Why are nova and neutron services going down from time to time?
OS: Redhat 6.5
Openstack : Havana
I have seen this problem where nova-compute service goes down, although the node still has the nova-compute service running, the nova service-list command displays the state as down. Sometimes there are few nodes down but most of the time they all go down at the same time.
When this happens, the conductor log shows the following error message and after restarting openstack-nova-conductor, everything goes back to normal. Seems to be an issue with qpid.
conductor.log
2014-03-05 17:18:51.896 42263 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.
2014-03-05 17:18:51.896 42263 TRACE root Traceback (most recent call last):
2014-03-05 17:18:51.896 42263 TRACE root File "/usr/lib/python2.6/site-packages/nova/openstack/common/excutils.py", line 78, in inner_func
2014-03-05 17:18:51.896 42263 TRACE root return infunc(*args, **kwargs)
2014-03-05 17:18:51.896 42263 TRACE root File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 709, in _consumer_thread
2014-03-05 17:18:51.896 42263 TRACE root self.consume()
2014-03-05 17:18:51.896 42263 TRACE root File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 700, in consume
2014-03-05 17:18:51.896 42263 TRACE root it.next()
2014-03-05 17:18:51.896 42263 TRACE root File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 617, in iterconsume
2014-03-05 17:18:51.896 42263 TRACE root yield self.ensure(_error_callback, _consume)
2014-03-05 17:18:51.896 42263 TRACE root File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 551, in ensure
2014-03-05 17:18:51.896 42263 TRACE root return method(*args, **kwargs)
2014-03-05 17:18:51.896 42263 TRACE root File "/usr/lib/python2.6/site-packages/nova/openstack/common/rpc/impl_qpid.py", line 608, in _consume
2014-03-05 17:18:51.896 42263 TRACE root nxt_receiver = self.session.next_receiver(timeout=timeout)
2014-03-05 17:18:51.896 42263 TRACE root File "<string>", line 6, in next_receiver
2014-03-05 17:18:51.896 42263 TRACE root File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 660, in next_receiver
2014-03-05 17:18:51.896 42263 TRACE root if self._ecwait(lambda: self.incoming, timeout):
2014-03-05 17:18:51.896 42263 TRACE root File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 50, in _ecwait
2014-03-05 17:18:51.896 42263 TRACE root result = self._ewait(lambda: self.closed or predicate(), timeout)
2014-03-05 17:18:51.896 42263 TRACE root File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 566, in _ewait
2014-03-05 17:18:51.896 42263 TRACE root result = self.connection._ewait(lambda: self.error or predicate(), timeout)
2014-03-05 17:18:51.896 42263 TRACE root File "/usr/lib/python2.6/site-packages/qpid/messaging/endpoints.py", line 209, in _ewait
2014-03-05 17:18:51.896 42263 TRACE root self.check_error()
2014-03-05 17:18:51.896 42263 TRACE root File "/usr/lib/python2.6/site-packages ...