Problem on evacuating VM with Ceph as storage backend

asked 2018-07-04 10:32:56 -0600

We have deployed OpenStack Queens using kolla-ansible and every functionalities of Nova and Cinder using Ceph as storage backend works fine except the evacuation process.

When we poweroff a compute node to test evacuation, nova backend detects that the VM has a disk on a shared storage and uses the same disk on new host to recover the VM, but VM fails to boot because of kernel panic caused by failure on read-only file system.

We traced the problem down to Ceph level and find out that the volume (image) on Ceph has an exclusive lock over some address on failed host, so the VM on new host could not use it.

We could workaround the problem by removing the lock before starting the evacuation process.

I need to mention that, when I attach additional volume to a VM it does not get an exclusive lock.

I tried to trace back the process in Nova and Cinder code, but I couldn't found where this exclusive lock happens but I'm pretty sure that rbd driver (in both Cinder and Nova) doesn't handle the unlocking process.

So my question is, is there anything wrong in my configuration or there is a bug in OpenStack code?

edit retag flag offensive close merge delete