Revision history [back]

click to hide/show revision 1
初始版本

计算节点内存不足导致openstack故障

计算节点swap内存使用了100%,然后操作系统运行了oom-killer机制,把云主机的进程杀死了,然后再次启动云主机时,发现ceph存储已找不到实例的相关文件。 message日志: Jul 1 22:06:04 node7 kernel: [52503] 0 52503 26974 24 11 0 0 sleep Jul 1 22:06:04 node7 kernel: Out of memory: Kill process 59364 (qemu-kvm) score 60 or sacrifice child Jul 1 22:06:04 node7 kernel: Killed process 59364 (qemu-kvm) total-vm:18006708kB, anon-rss:16825612kB, file-rss:0kB Jul 1 22:06:04 node7 journal: internal error: 监控程序的文件结尾 Jul 1 22:06:06 node7 kernel: qbrf555ac02-51: port 2(tapf555ac02-51) entered disabled state Jul 1 22:06:06 node7 kernel: device tapf555ac02-51 left promiscuous mode Jul 1 22:06:06 node7 kernel: qbrf555ac02-51: port 2(tapf555ac02-51) entered disabled state Jul 1 22:06:07 node7 NetworkManager[1374]: <info> (tapf555ac02-51): device state change: activated -> unmanaged (reason 'removed') [100 10 36] Jul 1 22:06:07 node7 NetworkManager[1374]: <warn> (qbrf555ac02-51): failed to detach bridge port tapf555ac02-51 Jul 1 22:06:07 node7 systemd-machined: Machine qemu-198-instance-000002a0 terminated. Jul 1 22:06:07 node7 NetworkManager[1374]: <warn> (tapf555ac02-51): failed to disable userspace IPv6LL address handling Jul 1 22:06:07 node7 dbus-daemon: dbus[1286]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' Jul 1 22:06:07 node7 dbus[1286]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' Jul 1 22:06:07 node7 systemd: Starting Network Manager Script Dispatcher Service... Jul 1 22:06:07 node7 kvm: 50 guests now active Jul 1 22:06:07 node7 journal: 读取数据时进入文件终点: 输入/输出错误 Jul 1 22:06:07 node7 dbus-daemon: dbus[1286]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher' Jul 1 22:06:07 node7 dbus[1286]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'

nova-compute日志: 2017-07-01 22:06:22.710 18891 INFO nova.compute.manager [-] [instance: 7e7e826b-c7cf-4361-8aaa-c88a8976ecc0] VM 已停止 (生命周期事件) 2017-07-01 22:06:22.950 18891 INFO nova.compute.manager [req-4da3e116-8d96-4d52-9468-0ee18e59ed89 - - - - -] [instance: 7e7e826b-c7cf-4361-8aaa-c88a8976ecc0] 在同步实例 电源状态期间,DB电源状态 (1) 与监测器上虚拟机电源状态 (4)不一致。更新DB中的电源状态与监测器匹配 2017-07-01 22:06:23.063 18891 WARNING nova.compute.manager [req-4da3e116-8d96-4d52-9468-0ee18e59ed89 - - - - -] [instance: 7e7e826b-c7cf-4361-8aaa-c88a8976ecc0] 实例被>它自己关闭。调用stop API。当前虚拟机虚拟机状态: active,当前任务状态:None,原始DB 电源状态: 1,当前VM 电源状态:4 2017-07-01 22:06:23.280 18891 INFO nova.compute.manager [req-4da3e116-8d96-4d52-9468-0ee18e59ed89 - - - - -] [instance: 7e7e826b-c7cf-4361-8aaa-c88a8976ecc0] 对实例发出 停止指令时,该实例在虚拟机管理程序中已被关闭电源。 2017-07-01 22:06:23.387 18891 INFO nova.virt.libvirt.driver [req-4da3e116-8d96-4d52-9468-0ee18e59ed89 - - - - -] [instance: 7e7e826b-c7cf-4361-8aaa-c88a8976ecc0] 实例已 经关闭。 2017-07-01 22:06:23.401 18891 INFO nova.virt.libvirt.driver [-] [instance: 7e7e826b-c7cf-4361-8aaa-c88a8976ecc0] 实例销毁成功。

当再次启动云主机时,nova-compute 有报错日志: 2017-07-03 09:21:44.348 18891 ERROR nova.virt.libvirt.driver [req-b9cd04f0-59ee-46ea-bc3f-8d5adcc57dae 3fc63b48ddd548a39eccc02594a928e8 9b21771470814e599c107f326f102374 - - -] [instance: 7e7e826b-c7cf-4361-8aaa-c88a8976ecc0] Failed to start libvirt guest 2017-07-03 09:21:44.545 18891 INFO os_vif [req-b9cd04f0-59ee-46ea-bc3f-8d5adcc57dae 3fc63b48ddd548a39eccc02594a928e8 9b21771470814e599c107f326f102374 - - -] Successfully unplugged vif VIFBridge(active=True,address=fa:16:3e:d6:4f:b1,bridge_name='qbrf555ac02-51',has_traffic_filtering=True,id=f555ac02-517c-460f-9e8c-bd0e0e433ec4,network=Network(6fffe4d0-21e1-496e-a462-800f8230fbd7),plugin='ovs',port_profile=VIFPortProfileBase,preserve_on_delete=False,vif_name='tapf555ac02-51') 2017-07-03 09:21:50.077 18891 INFO nova.compute.resource_tracker [req-b7069818-7930-4cd8-bb89-834332ec5990 - - - - -] 正审计节点node7本地可用的计算资源 2017-07-03 09:21:51.700 18891 INFO nova.compute.resource_tracker [req-b7069818-7930-4cd8-bb89-834332ec5990 - - - - -] 总共可用vcpus:25,总计分配的vcpus:101 2017-07-03 09:21:51.702 18891 INFO nova.compute.resource_tracker [req-b7069818-7930-4cd8-bb89-834332ec5990 - - - - -] 最终资源视图:name=node7 phys_ram=262109MB used_ram=208500MB phys_disk=26544GB used_disk=1660GB total_vcpus=25 used_vcpus=101 pci_stats=[] 2017-07-03 09:21:51.754 18891 INFO nova.compute.resource_tracker [req-b7069818-7930-4cd8-bb89-834332ec5990 - - - - -] 已为 node7:node7 更新 Compute_service 记录 2017-07-03 09:21:55.618 18891 WARNING nova.virt.libvirt.storage.rbd_utils [-] rbd 移除池 vms 的卷 7e7e826b-c7cf-4361-8aaa-c88a8976ecc0_disk 失败 2017-07-03 09:21:55.620 18891 WARNING oslo.service.loopingcall [-] Function 'nova.virt.libvirt.storage.rbd_utils._cleanup_vol' run outlasted interval by 10.03 sec 2017-07-03 09:22:05.651 18891 WARNING nova.virt.libvirt.storage.rbd_utils [-] rbd 移除池 vms 的卷 7e7e826b-c7cf-4361-8aaa-c88a8976ecc0_disk 失败 2017-07-03 09:22:05.652 18891 WARNING oslo.service.loopingcall [-] Function 'nova.virt.libvirt.storage.rbd_utils._cleanup_vol' run outlasted interval by 9.03 sec 2017-07-03 09:22:14.028 18891 WARNING nova.virt.libvirt.storage.rbd_utils [-] rbd 移除池 vms 的卷 7e7e826b-c7cf-4361-8aaa-c88a8976ecc0_disk 失败 2017-07-03 09:22:14.029 18891 WARNING oslo.service.loopingcall [-] Function 'nova.virt.libvirt.storage.rbd_utils._cleanup_vol' run outlasted interval by 7.38 sec 2017-07-03 09:22:44.085 18891 INFO nova.virt.libvirt.driver [req-b9cd04f0-59ee-46ea-bc3f-8d5adcc57dae 3fc63b48ddd548a39eccc02594a928e8 9b21771470814e599c107f326f102374 - - -] [instance: 7e7e826b-c7cf-4361-8aaa-c88a8976ecc0] 删除实例文件 /var/lib/nova/instances/7e7e826b-c7cf-4361-8aaa-c88a8976ecc0_del 2017-07-03 09:22:44.087 18891 INFO nova.virt.libvirt.driver [req-b9cd04f0-59ee-46ea-bc3f-8d5adcc57dae 3fc63b48ddd548a39eccc02594a928e8 9b21771470814e599c107f326f102374 - - -] [instance: 7e7e826b-c7cf-4361-8aaa-c88a8976ecc0] /var/lib/nova/instances/7e7e826b-c7cf-4361-8aaa-c88a8976ecc0_del 的删除完成 2017-07-03 09:22:44.509 18891 INFO nova.compute.manager [req-b9cd04f0-59ee-46ea-bc3f-8d5adcc57dae 3fc63b48ddd548a39eccc02594a928e8 9b21771470814e599c107f326f102374 - - -] [instance: 7e7e826b-c7cf-4361-8aaa-c88a8976ecc0] Successfully reverted task state from powering-on on failure for instance. 2017-07-03 09:22:44.604 18891 ERROR oslo_messaging.rpc.server [req-b9cd04f0-59ee-46ea-bc3f-8d5adcc57dae 3fc63b48ddd548a39eccc02594a928e8 9b21771470814e599c107f326f102374 - - -] Exception during message handling 2017-07-03 09:22:44.604 18891 ERROR oslo_messaging.rpc.server Traceback (most recent call last): 2017-07-03 09:22:44.604 18891 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 133, in _process_incoming 2017-07-03 09:22:44.604 18891 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message) 2017-07-03 09:22:44.604 18891 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 150, in dispatch

请求帮助,为什么会出现该情况? message日志出现Jul 1 22:06:04 node7 journal: internal error: 监控程序的文件结尾,是否是由于该内部错误引起呢?