Instances stuck in spawning randomly

asked 2018-10-19 02:42:03 -0500

Hi, I have an OpenStack Pike environment with 3 compute nodes running Ubuntu 16.04. I encounter some issues when launching multiple instances at once on the same compute node and it can be replicated on all of them, any time. For example when I try to spawn 10 instances at once I get one or two instances stuck in spawning state. I can't delete the stuck instances even if I use reset-state against that instance. New instances can't be launched from that moment using the hypervisor that is experiencing this behaviour. Looking at the hypervisor:

  • I have installed libvirtd (libvirt) 3.6.0
  • nova-compute stops logging to nova-compute.log
  • can't list the instances running using 'virsh list' command - it get stuck, no output
  • libvirtd.service status shows the PID of the instances stuck in spawning state as: "12237 /usr/sbin/libvirtd -l" instead of "18095 /usr/bin/qemu-system-x86_64 -name guest=instance-00000b93,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-243-instance-00000b93/master-key.aes...."

The only way that I've found to "recover" the hypervisor is to kill the process libvirtd associated with the stuck instances. After I do that nova-compute resume logging and everything gets back to normal. The stuck instances goes to error state and you can remove and redeploy it.

Has someone came across the same behaviour? Where should I look?

Many thanks! Cristian

edit retag flag offensive close merge delete