Certain VMs fail to migrate

asked 2018-09-17 09:53:34 -0600

mojavezax gravatar image

updated 2018-09-14 13:00:55 -0600

Hi All,

Got a sore head after banging it against the wall for over a week. I've seen many similar posts, but none of them are helpful.

I'm using Pike with Ceph Luminous as the back end storage, all on CentOS 7.5. Everything is at the latest patch level.

I have 3 compute hosts. Some of my VMs migrate between them all just fine. Others won't migrate at all. It may be the larger ones that won't migrate, but at only 2GB RAM and 16GB disk one of them isn't large at all.

After I initiate the migration, I see a message in the nova-compute.log on the target node (called Krypton) that lists the CPU capabilities. After one minute I see the error "Initialize connection failed for volume ..." and HTTP 500 error. That's followed by a few more messages, then a mile-long stack trace from oslo_messaging.rpc.server.

At that same second, log messages start appearing in the nova-compute.log on the source node (called Xenon), starting with "Pre live migration failed at http://krypton.example.com: MessagingTimeout: Timed out waiting for a reply to message ID ..." Then a stack trace from nova.compute.manager appears. Lastly I see an INFO message saying "No calling threads waiting for msg_id ..."

Any ideas what's causing this? Are there settings I need to adjust? Any suggestions on how to further debug this?

Thanks!

Command:

openstack server migrate --live krypton.example.com b5f912f5-3c49-466b-ad43-525c0476dbf9

Krypton:/var/log/nova/nova-compute.log (target node):

2018-09-14 11:31:39.939 169568 INFO nova.virt.libvirt.driver [req-6c25aada-06e2-4ab7-bd67-e8e2cf49cf29 b4d3c8b03a8d432c999e101f22f8e19e c17f7f6ae0f44372a25439fe22357500 - default default] Instance launched has CPU info: {"vendor": "Intel", "model": "Broadwell-IBRS", "arch": "x86_64", "features": ["pge", "avx", "xsaveopt", "clflush", "sep", "rtm", "tsc_adjust", "tsc-deadline", "dtes64", "stibp", "invpcid", "tsc", "fsgsbase", "xsave", "smap", "vmx", "erms", "xtpr", "cmov", "hle", "smep", "ssse3", "est", "pat", "monitor", "smx", "pbe", "lm", "msr", "adx", "3dnowprefetch", "nx", "fxsr", "syscall", "tm", "sse4.1", "pae", "sse4.2", "pclmuldq", "cx16", "pcid", "fma", "vme", "popcnt", "mmx", "osxsave", "cx8", "mce", "de", "rdtscp", "ht", "dca", "lahf_lm", "abm", "rdseed", "pdcm", "mca", "pdpe1gb", "apic", "sse", "f16c", "pse", "ds", "invtsc", "pni", "tm2", "avx2", "aes", "sse2", "ss", "ds_cpl", "arat", "bmi1", "bmi2", "acpi", "spec-ctrl", "fpu", "ssbd", "pse36", "mtrr", "movbe", "rdrand", "x2apic"], "topology": {"cores": 8, "cells": 2, "threads": 2, "sockets": 1}}
2018-09-14 11:32:23.276 169568 WARNING nova.compute.resource_tracker [req-6c25aada-06e2-4ab7-bd67-e8e2cf49cf29 b4d3c8b03a8d432c999e101f22f8e19e c17f7f6ae0f44372a25439fe22357500 - default default] Instance b5f912f5-3c49-466b-ad43-525c0476dbf9 has been moved to another host xenon.example.com(xenon.example.com). There are allocations remaining against the source host that might need to be removed: {u'resources': {u'VCPU': 4, u'MEMORY_MB': 8192, u'DISK_GB': 40}}.
2018-09-14 11:32:23.301 169568 INFO nova.compute.resource_tracker [req-6c25aada-06e2-4ab7-bd67-e8e2cf49cf29 b4d3c8b03a8d432c999e101f22f8e19e c17f7f6ae0f44372a25439fe22357500 - default default] Final resource view: name=krypton.example.com phys_ram=196510MB used_ram=14848MB phys_disk=18602GB used_disk=88GB total_vcpus=32 used_vcpus=9 pci_stats=[]
2018-09-14 11:32:40.954 169568 ERROR nova.volume.cinder [req-6c25aada-06e2-4ab7-bd67-e8e2cf49cf29 b4d3c8b03a8d432c999e101f22f8e19e c17f7f6ae0f44372a25439fe22357500 - default default] Initialize connection failed for volume e4e411d7-59e7-463b-8598-54a4838aa898 on host krypton.example.com. Error: The server has either erred or ...
(more)
edit retag flag offensive close merge delete

Comments

It looks as though the HTTP 500 is generated by Cinder. See if you find references to the volume e4e411d7-59e7-463b-8598-54a4838aa898 in the Cinder log. The request ID req-fdf84a9b-60d6-496e-a710-7265dce7c79d might also be useful when going through the Cinder logs.

Bernd Bausch gravatar imageBernd Bausch ( 2018-09-17 10:19:08 -0600 )edit

However, I wonder where the earlier message "the instance has been moved to another host" comes from. Why has it been moved? The Cinder problem could be a consequence of that move, but perhaps Cinder's messages give some clue about what happened.

Bernd Bausch gravatar imageBernd Bausch ( 2018-09-17 10:20:56 -0600 )edit