Ask Your Question
1

RabbitMQ failover issue Kilo

asked 2015-10-19 10:23:05 -0500

Attila Szlovencsak gravatar image

Hi!

I have an issue with HA RabbitMQ queue under Kilo (2015.1.1).

If a rabbit node fails, some of the services (like nova-scheduler) manage to fail over to an other node.

2015-10-19 16:47:41.741 28727 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on controller1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2015-10-19 16:47:42.763 28727 DEBUG amqp [-] Start from server, version: 0.9, properties: {u'information': u'Licensed under the MPL.  See http://www.rabbitmq.com/', u'product': u'RabbitMQ', u'copyright': u'Copyright (C) 2007-2013 GoPivotal, Inc.', u'capabilities': {u'exchange_exchange_bindings': True, u'connection.blocked': True, u'authentication_failure_close': True, u'basic.nack': True, u'consumer_priorities': True, u'consumer_cancel_notify': True, u'publisher_confirms': True}, u'platform': u'Erlang/OTP', u'version': u'3.2.4'}, mechanisms: [u'AMQPLAIN', u'PLAIN'], locales: [u'en_US'] _start /usr/lib/python2.7/dist-packages/amqp/connection.py:754
2015-10-19 16:47:42.780 28727 DEBUG amqp [-] Open OK! _open_ok /usr/lib/python2.7/dist-packages/amqp/connection.py:640
2015-10-19 16:47:42.782 28727 DEBUG amqp [-] using channel_id: 1 __init__ /usr/lib/python2.7/dist-packages/amqp/channel.py:80
2015-10-19 16:47:42.790 28727 DEBUG amqp [-] Channel open _open_ok /usr/lib/python2.7/dist-packages/amqp/channel.py:438
2015-10-19 16:47:43.053 28727 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on controller2:5672

But on the same node, nova-api fails with the following log

2015-10-19 16:47:42.014 11763 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on controller1:5672 is unreachable: [Errno 111
] ECONNREFUSED. Trying again in 1 seconds.
2015-10-19 16:47:42.271 11763 INFO oslo_messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 111] 
ECONNREFUSED
2015-10-19 16:47:42.899 11764 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on controller2:5672
2015-10-19 16:47:42.906 11795 INFO oslo_messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 111] ECONNREFUSED
2015-10-19 16:47:43.052 11764 INFO oslo_messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 111] ECONNREFUSED
2015-10-19 16:47:43.151 11796 INFO oslo_messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 111] ECONNREFUSED
2015-10-19 16:47:43.541 11796 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on controller3:5672
2015-10-19 16:47:43.543 11763 INFO oslo_messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 111] ECONNREFUSED
2015-10-19 16:47:43.569 11795 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on controller3:5672
2015-10-19 16:47:43.607 11763 INFO oslo_messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on controller3:5672
2015-10-19 16:47:44.163 11795 INFO oslo_messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 111] ECONNREFUSED
2015-10-19 16:47:44.326 11764 INFO oslo_messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 111]

I wonder why it happens, since these two services use the same nova.conf ... (more)

edit retag flag offensive close merge delete

1 answer

Sort by ยป oldest newest most voted
1

answered 2015-10-20 09:25:02 -0500

Attila Szlovencsak gravatar image

updated 2015-10-27 14:49:34 -0500

To answer my own question.. :)

It seems to be a timing problem inside kombu.... Browsing the code, I found the following lines in oslo_messaging/_drivers/impl_rabbit.py

# XXX(nic): when reconnecting to a RabbitMQ cluster
# with mirrored queues in use, the attempt to release the
# connection can hang "indefinitely" somewhere deep down
# in Kombu.  Blocking the thread for a bit prior to
# release seems to kludge around the problem where it is
# otherwise reproduceable.
if self.driver_conf.kombu_reconnect_delay > 0:
            time.sleep(self.driver_conf.kombu_reconnect_delay)

So increasing the default value (1.0) seems to solve the problem.

nova.conf

[oslo_messaging_rabbit]
...
kombu_reconnect_delay=3.0

And now in nova-api.log

2015-10-20 15:55:41.260 8909 ERROR oslo_messaging._drivers.impl_rabbit [req-b5dee03a-0fd7-420b-a962-c222362d8a7f 27e5ecf3a76c4db494f424bdb9188de1 78620f45f5e740c7b302e63717201a66 - - -] AMQP server on controller1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 3 seconds.
2015-10-20 15:55:44.266 8909 ERROR oslo_messaging._drivers.impl_rabbit [req-b5dee03a-0fd7-420b-a962-c222362d8a7f 27e5ecf3a76c4db494f424bdb9188de1 78620f45f5e740c7b302e63717201a66 - - -] AMQP server on controller1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 3 seconds.
2015-10-20 15:55:47.272 8909 DEBUG amqp [req-b5dee03a-0fd7-420b-a962-c222362d8a7f 27e5ecf3a76c4db494f424bdb9188de1 78620f45f5e740c7b302e63717201a66 - - -] Start from server, version: 0.9, properties: {u'information': u'Licensed under the MPL.  See http://www.rabbitmq.com/', u'product': u'RabbitMQ', u'copyright': u'Copyright (C) 2007-2013 GoPivotal, Inc.', u'capabilities': {u'exchange_exchange_bindings': True, u'connection.blocked': True, u'authentication_failure_close': True, u'basic.nack': True, u'consumer_priorities': True, u'consumer_cancel_notify': True, u'publisher_confirms': True}, u'platform': u'Erlang/OTP', u'version': u'3.2.4'}, mechanisms: [u'AMQPLAIN', u'PLAIN'], locales: [u'en_US'] _start /usr/lib/python2.7/dist-packages/amqp/connection.py:754
2015-10-20 15:55:47.275 8909 INFO oslo_messaging._drivers.impl_rabbit [req-b5dee03a-0fd7-420b-a962-c222362d8a7f 27e5ecf3a76c4db494f424bdb9188de1 78620f45f5e740c7b302e63717201a66 - - -] Reconnected to AMQP server on controller3:5672

If you have many nova-api instances (one master, and several child processes), you will see lots of Trying to reconnect messages, until all child processes are forced to fail over. During that period, you would see longer delays (like 30s) in compute REST API calls that needs AMQ (like nova pause <vm_id>).

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Get to know Ask OpenStack

Resources for moderators

Question Tools

1 follower

Stats

Asked: 2015-10-19 10:23:05 -0500

Seen: 1,681 times

Last updated: Oct 27 '15