Ask Your Question
4

How to debug cluster where instance status is stuck on "scheduling"?

asked 2013-09-24 15:05:27 -0600

vladber gravatar image

updated 2013-09-24 18:03:16 -0600

smaffulli gravatar image

In my OS cluster I have 3 control nodes, 2 load balancers and 2 compute nodes.

When I run nova boot command or launch a new instance from GUI a status of new created instance always remains as scheduling. I stopped the rabbitmq-server on 2 control nodes but this didn't help .

What may cause this issue ? What are the relevant logs to trace it?

edit retag flag offensive close merge delete

Comments

1

Did you find any good ways of debuging this problem?

Gamekiller77 gravatar imageGamekiller77 ( 2014-03-25 08:37:13 -0600 )edit

2 answers

Sort by ยป oldest newest most voted
1

answered 2014-11-05 12:44:27 -0600

andybrucenet gravatar image

Hi! I run into this frequently...I need to write a post on it in my blog. I have a robust HA install with 2 of everything. What I do to debug the problem:

  1. My haproxy.cfg explicitly specifies "backup" for my secondary nodes. I don't want any confusion about where messages are delivered. Here's an example:
      listen keystone_api 172.24.8.21:5000
        balance source
        option tcpka
        option tcplog
        mode tcp
        maxconn 10000
        server lvoskeyst100 172.24.8.49:5000 check inter 2000 rise 2 fall 5
        server lvoskeyst200 172.24.8.50:5000 check inter 2000 rise 2 fall 5 backup
    
  2. My Keystone controllers are installed with RabbitMQ and nothingelse. So I restart both sets of services on the primary keystone node:
    service openstack-keystone stop
    service rabbitmq-server stop
    service rabbitmq-server start
    service openstack-keystone start
    
    Note that I explicitly stop / restart each service.
  3. I have Nova, Horizon, and Heat controllers all on the same node. I'll then restart the services on the primary node:
    Stopping openstack-nova-api:                               [  OK  ]
    Starting openstack-nova-api:                               [  OK  ]
    Stopping openstack-nova-cert:                              [  OK  ]
    Starting openstack-nova-cert:                              [  OK  ]
    Stopping openstack-nova-consoleauth:                       [  OK  ]
    Starting openstack-nova-consoleauth:                       [  OK  ]
    Stopping openstack-nova-scheduler:                         [  OK  ]
    Starting openstack-nova-scheduler:                         [  OK  ]
    Stopping openstack-nova-conductor:                         [  OK  ]
    Starting openstack-nova-conductor:                         [  OK  ]
    Stopping openstack-nova-novncproxy:                        [  OK  ]
    Starting openstack-nova-novncproxy:                        [  OK  ]
    
  4. Next, I manually reset the state of whatever "stuck" VM I have:
    nova reset-state --active dd2bbc8c-c2fa-4894-8c98-3c858e604d46
    

The above - I confess - doesn't fix the root cause. Which is certainly RabbitMQ! But it does get me back up and running every time so far :)

edit flag offensive delete link more

Comments

Hi, could you find any root cause for this? In my setup it happens approx every 60min, so your solution is rather impossible for me :)

mathias gravatar imagemathias ( 2015-01-09 09:26:16 -0600 )edit
0

answered 2014-11-06 09:13:51 -0600

jfarschman gravatar image

Morning,

I think Andy is correct when he says "Which is certainly RabbitMQ!". I faced a similar problem where a firewall would DROP packets without notifying the endpoints that the connections was dropped. This was done because the connection was idle for too long. Better behavior is to let both parties know the connection was dropped and should be reestablished. But, DROP, don't notify was the behavior.

So, my RabbitMQ connection was dropped every 2 hours. Restarting rabbit was all I needed to do to solve it.

My solution was to ask the FW team to change the behavior, but in the meantime, I also changed the sysctl.conf to have keepalive packets. Below is a little snippet from my puppet config.

"net.ipv4.tcp_keepalive_time" => { value => 60}, "net.ipv4.tcp_keepalive_intvl" => { value => 5}, "net.ipv4.tcp_keepalive_probes" => { value => 6},

Hope this helps.

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Get to know Ask OpenStack

Resources for moderators

Question Tools

1 follower

Stats

Asked: 2013-09-24 15:05:27 -0600

Seen: 2,503 times

Last updated: Nov 06 '14