Nova-scheduler overloads compute host when deploying multiple instances
Hello all,
Running into an issue with nova-scheduler service (Filter Scheduler enabled by default) and the Hyper-V compute driver. I am deploying several instances with one 'nova boot' command to several nova-compute hosts (five in total). The scheduler starts handing out the instances to separate hosts and they start to come up. After a while I notice that some instances don't ever boot on the scheduled host. Those instances have been rescheduled 3 times by the scheduler and I would assume they would report back to the control node in an 'ERROR' state. On the control they hang in a 'Building' state when they are never actually building on the compute nodes.
The part that I am curious about is in this case (we will use compute-1, compute-2, and compute-3 for the various compute node names) compute-1 and compute-2 have enough resources to handle the instances that don't get booted on compute-3, but they never get scheduled there. That or the 3 reschedule attempt always happens on compute-3 where it fails to schedule (since it is out of resources) and fails out there, after the 3rd attempt. I checked the scheduler and compute logs and found out that some of the resources reported back from the compute services are negative (such as a negative disk value). After the scheduler weights the host it always chooses compute-3 (even though it isn't the best option).
Part of the problem looks like the scheduler is somehow getting faulty information about the resources on the Hyper-V compute nodes. I enabled verbose logging on the controller, and the scheduler log shows this during the periodic updates of resource info:
2013-04-15 16:06:05.460 6613 DEBUG nova.openstack.common.rpc.amqp [-] received {u'contextroles': [], u'contextrequestid': u'req-4ace290b-c5a9-4bf8-a706-2af1bbe37b50', u'contextquotaclass': None, u'contextprojectname': None, u'contextservicecatalog': [], u'contextusername': None, u'contextauthtoken': '<sanitized>', u'args': {u'servicename': u'compute', u'host': u'CN10.private.cloud.com', u'capabilities': [{u'hostmemoryfreecomputed': 4668, u'diskavailable': 241, u'supportedinstances': [[u'i686', u'hyperv', u'hvm'], [u'x8664', u'hyperv', u'hvm']], u'hostmemoryoverhead': 191912, u'hostip': u'127.0.0.1', u'hypervisorhostname': u'CN10', u'hostmemoryfree': 4668, u'disktotal': 558, u'hostmemorytotal': 196580, u'diskused': 317}]}, u'contexttenant': None, u'contextinstancelockchecked': False, u'contexttimestamp': u'2013-04-15T21:06:12.356000', u'contextisadmin': True, u'version': u'2.4', u'contextprojectid': None, u'contextuser': None, u'contextreaddeleted': u'no', u'contextuserid': None, u'method': u'updateservicecapabilities', u'contextremoteaddress': None} safelog /usr/lib/python2.6/site-packages/nova/openstack/common/rpc/common.py:272 2013-04-15 16:06:05.461 6613 DEBUG nova.openstack.common.rpc.amqp [-] unpacked context: {'readdeleted': u'no', 'projectname': None, 'userid': None, 'roles': [], 'timestamp ...