nova rebuild fails when the instance has ceph snapshot

Hi Team,

We have openstack with ceph storage backend. We use ceph snapshot as our backup strategy for backing up root and additional drives which are attached. But what we have observed is whenever we do nova rebuild on a server which is having a snapshot created in ceph, the rebuld completes without error but actual rebuild doesn't happen at all. Post rebuild we still have the same corrupted VM or at times we rebuild the vm to a fresh new OS like from Ubuntu to centos, but after rebuild we still would have same Ubuntu. In all the cases what we have in common is ceph is having a snapshot for the instance root drives. For instances which are not having ceph snap, the rebuild just works like a charm.

Has anyone faced this kind of issue. Please guide how to resolve this.

Regards, Ram.

I haven't done much rebuilding, so my experience is limited here. But have you turned on debug logs for nova? I'd expect to see the commands it's trying to execute, maybe there's a hint what could have gone wrong. If I have the time I'll try to reproduce that.

Your description is accurate, I was able to reproduce this. I'll try to find out more.

Alright, I got it. So if you rebuild an instance the underlying rbd image has to be deleted to be able to reuse the same ID. You can see this for a really short time if you have running watch -n 0.2 rbd info pool/image_disk while the instance is rebuilding (an instance without a snapshot). But since it's not possible to delete an instance that has rbd snapshots (via horizon or CLI) the rebuild fails and the instance is reverted to a working state.

Many thanks for your comments. I wrote a job for instances which has ceph snaps which will first shutoff the vm and deletes snaps at ceph backend and then trigger a rebuild from openstack and it works.

Is there a simple way to find total space consumed by snapshots alone in ceph?

Thanks for the explanation, this seems to be existing in openstack Queens with ceph Luminous. Not sure if this will be fixed in further releases.

Although we're already running Nautilus (and this also happens there) our cloud still runs Ocata, so it's hard to say if this still applies. We plan to upgrade OpenStack to a current release, I'm curious if this has been fixed since Ceph has become one of the most used storage backends for openstack

