Rescheduling instance fails if the first host that's chosen is in a failed state

asked 2015-07-21 15:19:01 -0500

I'm running into this really strange error. I'm running the Mellanox neutron-mlnx-agent on all the compute hosts alongside eswitchd. Once in a while the eswitchd damon fails causing the neutron-mlnx-agent to also fail (has nothing to do with the error I'm trying to fix right now). If I try to launch an instance, and this node where the neutron-mlnx-agent has failed is the first designated host in the scheduler, obviously launching this instance there will fail. It will actually fail with the following error message (taken directly from the nova-compute.log):

 Unexpected vif_type=binding_failed

This is all expected behaviour. Openstack will now try to reschedule this node on other hosts before giving up. This is also expected behaviour. However, it will fail on all other nodes it tries to reschedule this instance with the same error, even though all the other hosts are healthy. If I remove the unhealthy node from the Openstack hosts, the scheduler will now be able to launch an instance just fine, on nodes that it previously said it failed to do so on. In case this is too confusing, I'll try to give an example:

Three total compute hosts:
A -- neutron-mlnx-agent is in a failed state
B -- healthy node
C -- healthy node

I submit a request to launch an instance. The scheduler will try the following order:

launch instance on A -> fails with 'Unexpected vif_type=binding_failed'
reschedule instance on B -> fails with 'Unexpected vif_type=binding_failed'
reschedule instance on C -> fails with 'Unexpected vif_type=binding_failed'

At this point I take out host A from the Openstack available hosts and attempt to launch another instance. The scheduler will do the following

 launch instance on B -> Successful

It's almost as if that error is passed along with the request to reschdule and the nova-compute daemon on the new host will just place it directly in an error state. I'm wondering if anyone here has ran into a similar issue or if you have any pointers as to what I should be looking into.

Update* I've started looking more into the compute logs on server B/C in debug mode. Before it actually tries to launch the image, nova attempts to update it's cache. The following line from the logs are relevant:

2015-07-22 11:45:37.456 199089 DEBUG [-] Updating cache with info: [VIF({'profile': {}, 'ovs_interfaceid': None, 'network': Network({'bridge': None, 'subnets': [Subnet({'ips': [FixedIP({'meta': {}, 'version': 4, 'type': 'fixed', 'floating_ips': [], 'address': u''})], 'version': 4, 'meta': {}, 'dns': [IP({'meta': {}, 'version': 4, 'type': 'dns', 'address': u''})], 'routes': [], 'cidr': u'', 'gateway': IP({'meta': {}, 'version': 4, 'type': 'gateway', 'address': u''})})], 'meta': {'injected': False, 'tenant_id': u'504e44b46dfd41c89beabd06a67f7f3d'}, 'id': u'898898fa-8867-4cd2-a224-5a08ddaafdeb', 'label': u'hsc_net'}), 'devname': u'tapcd40b4f3-fe', 'vnic_type': u'normal', 'qbh_params': None, 'meta': {}, 'details': {}, 'address': u'fa:16:3e:4a:7f:ab', 'active': False, 'type': u'binding_failed', 'id': u'cd40b4f3-fe24-4abc-b520-934baee750ac', 'qbg_params': None})] update_instance_cache_with_nw_info /usr/lib/python2 ...
Looks strange to me also. Following it.

while rescheduling instance on B/C, do you have any log from Mellanox neutron-mlnx-agent ?

Ranjit ( 2015-07-22 01:32:22 -0500 )

@Ranjit I updated my question at the end. Also there are no entries in that log that stand out.

Florin ( 2015-07-22 10:59:42 -0500 )

1 answer

answered 2015-07-23 05:33:55 -0500

Can you write below settings in nova.conf and restart necessary services.

vif_plugging_is_fatal: false
vif_plugging_timeout: 0
I tried this earlier and it didn't make a difference. Thanks for the suggestion though.

Florin ( 2015-07-23 10:11:44 -0500 )

