Revision history [back]

click to hide/show revision 1
initial version

To answer my own question.. :)

It seems to be a timing problem inside kombu.... Browsing the code, I found the following lines in oslo_messaging/_drivers/impl_rabbit.py

# XXX(nic): when reconnecting to a RabbitMQ cluster
# with mirrored queues in use, the attempt to release the
# connection can hang "indefinitely" somewhere deep down
# in Kombu.  Blocking the thread for a bit prior to
# release seems to kludge around the problem where it is
# otherwise reproduceable.
if self.driver_conf.kombu_reconnect_delay > 0:
            time.sleep(self.driver_conf.kombu_reconnect_delay)

So increasing the default value (1.0) seems to solve the problem.

nova.conf

[oslo_messaging_rabbit]
...
kombu_reconnect_delay=3.0

And now in nova-api.log

2015-10-20 15:55:41.260 8909 ERROR oslo_messaging._drivers.impl_rabbit [req-b5dee03a-0fd7-420b-a962-c222362d8a7f 27e5ecf3a76c4db494f424bdb9188de1 78620f45f5e740c7b302e63717201a66 - - -] AMQP server on controller1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 3 seconds.
2015-10-20 15:55:44.266 8909 ERROR oslo_messaging._drivers.impl_rabbit [req-b5dee03a-0fd7-420b-a962-c222362d8a7f 27e5ecf3a76c4db494f424bdb9188de1 78620f45f5e740c7b302e63717201a66 - - -] AMQP server on controller1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 3 seconds.
2015-10-20 15:55:47.272 8909 DEBUG amqp [req-b5dee03a-0fd7-420b-a962-c222362d8a7f 27e5ecf3a76c4db494f424bdb9188de1 78620f45f5e740c7b302e63717201a66 - - -] Start from server, version: 0.9, properties: {u'information': u'Licensed under the MPL.  See http://www.rabbitmq.com/', u'product': u'RabbitMQ', u'copyright': u'Copyright (C) 2007-2013 GoPivotal, Inc.', u'capabilities': {u'exchange_exchange_bindings': True, u'connection.blocked': True, u'authentication_failure_close': True, u'basic.nack': True, u'consumer_priorities': True, u'consumer_cancel_notify': True, u'publisher_confirms': True}, u'platform': u'Erlang/OTP', u'version': u'3.2.4'}, mechanisms: [u'AMQPLAIN', u'PLAIN'], locales: [u'en_US'] _start /usr/lib/python2.7/dist-packages/amqp/connection.py:754
2015-10-20 15:55:47.275 8909 INFO oslo_messaging._drivers.impl_rabbit [req-b5dee03a-0fd7-420b-a962-c222362d8a7f 27e5ecf3a76c4db494f424bdb9188de1 78620f45f5e740c7b302e63717201a66 - - -] Reconnected to AMQP server on controller3:5672

To answer my own question.. :)

It seems to be a timing problem inside kombu.... Browsing the code, I found the following lines in oslo_messaging/_drivers/impl_rabbit.py

# XXX(nic): when reconnecting to a RabbitMQ cluster
# with mirrored queues in use, the attempt to release the
# connection can hang "indefinitely" somewhere deep down
# in Kombu.  Blocking the thread for a bit prior to
# release seems to kludge around the problem where it is
# otherwise reproduceable.
if self.driver_conf.kombu_reconnect_delay > 0:
            time.sleep(self.driver_conf.kombu_reconnect_delay)

So increasing the default value (1.0) seems to solve the problem.

nova.conf

[oslo_messaging_rabbit]
...
kombu_reconnect_delay=3.0

And now in nova-api.log

2015-10-20 15:55:41.260 8909 ERROR oslo_messaging._drivers.impl_rabbit [req-b5dee03a-0fd7-420b-a962-c222362d8a7f 27e5ecf3a76c4db494f424bdb9188de1 78620f45f5e740c7b302e63717201a66 - - -] AMQP server on controller1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 3 seconds.
2015-10-20 15:55:44.266 8909 ERROR oslo_messaging._drivers.impl_rabbit [req-b5dee03a-0fd7-420b-a962-c222362d8a7f 27e5ecf3a76c4db494f424bdb9188de1 78620f45f5e740c7b302e63717201a66 - - -] AMQP server on controller1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 3 seconds.
2015-10-20 15:55:47.272 8909 DEBUG amqp [req-b5dee03a-0fd7-420b-a962-c222362d8a7f 27e5ecf3a76c4db494f424bdb9188de1 78620f45f5e740c7b302e63717201a66 - - -] Start from server, version: 0.9, properties: {u'information': u'Licensed under the MPL.  See http://www.rabbitmq.com/', u'product': u'RabbitMQ', u'copyright': u'Copyright (C) 2007-2013 GoPivotal, Inc.', u'capabilities': {u'exchange_exchange_bindings': True, u'connection.blocked': True, u'authentication_failure_close': True, u'basic.nack': True, u'consumer_priorities': True, u'consumer_cancel_notify': True, u'publisher_confirms': True}, u'platform': u'Erlang/OTP', u'version': u'3.2.4'}, mechanisms: [u'AMQPLAIN', u'PLAIN'], locales: [u'en_US'] _start /usr/lib/python2.7/dist-packages/amqp/connection.py:754
2015-10-20 15:55:47.275 8909 INFO oslo_messaging._drivers.impl_rabbit [req-b5dee03a-0fd7-420b-a962-c222362d8a7f 27e5ecf3a76c4db494f424bdb9188de1 78620f45f5e740c7b302e63717201a66 - - -] Reconnected to AMQP server on controller3:5672

If you have many nova-api instances (one master, and several child processes), you will see lots of Trying to reconnect messages, until all child processes are forced to fail over. During that period, you would see longer delays (like 30s) in compute REST API calls that needs AMQ (like nova pause <vm_id>).