Rabbitmq in HA mode, service didn't connect to other nodes when host down

asked 2016-04-21 05:39:51 -0500

bochi-michael gravatar image

I deployed openstack HA cluster with 3 nodes, with 3 rabbitmq servers.

Problem 1:

A service, say nova-compute, if the rabbitmq server it connected to is powered off, it will get stuck, not reconnecting to other nodes, not sending any message. All the compute nodes are shown "down" in nova service-list, all neutron agents are down, while cinder-volume, nova-conductor, nova-scheduler and others are still up.

I checked the tcp connection, nova compute is still connecting to port 5672 of the powered off controller:

tcp 0 0 ESTABLISHED 5371/python
tcp 0 0 ESTABLISHED 5371/python
tcp 0 0 ESTABLISHED 5371/python

I tried to reduce tcp keepalive time and try again, but this did not work, the connections remains to the dead controller.

net.ipv4.tcp_keepalive_time = 10
net.ipv4.tcp_keepalive_probes = 2
net.ipv4.tcp_keepalive_intvl = 5

Could you help me how to let the clients reconnect to other servers when one is head?

rabbit config in nova.conf:

rabbit_hosts =,,
rabbit_userid = openstack
rabbit_password = RABBIT_PASS

Problem 2:

when I power on the controller, nova-compute reconnected to another rabbit server, and sending message, but they are still "down". I checked the log, it is getting an timeout error: "MessagingTimeout: Timed out waiting for a reply to message ID"

I have configured rabbit cluster and queue mirror, why do I got this error?

root@controller1:~# rabbitmqctl cluster_status
Cluster status of node rabbit@controller1 ...

root@controller1:~# rabbitmqctl list_policies
Listing policies ...
/ ha-all all ^(?!amq\.).* {"ha-mode":"all"} 0

edit retag flag offensive close merge delete