What causes all VMs in a compute node to be inaccessible
We have a small OpenStack (icehouse) with a controller node and 5 compute nodes. The operating system of the nodes is Ubuntu 14.04. Networking is done with Neutron and the open Vswitch plugin / agent. Hypervisor is KVM. We have encountered three times a situation where all virtual machines running in a compute node became inaccessible. Neither SSH, ping or the web applications running on them worked. Horizon reported all VMs active and running and showd all compute services and network agents enabled and up. I looked at the logs of different OpenStack components on the compute nodes but found nothing helpful though I might have missed something.
Our OpenStack has been running now about three months or at least part of it since some compute nodes have been added later. First this problem affected two computenodes in the beginning of this month at the same time though it might not have started exactly at the same time. The use of our OpenStack was quite low so it might have taken some time before we noticed the problem. Then a few weeks later it happened again with one compute node which was one of those affected before. Both times I got everything working again by first shutting down the affected VMs from Horizon then rebooting the compute node and then starting the VMs again.
Does anybody have any idea what could be causing this problem? Alternatively do you have any suggestions what I should do if / when this happens again so that I could find the root cause of this problem. I realize that it could be a bit challenging to find the problem now when everything is working but I would like to be prepared if it happens again.