Ask Your Question

Openstack Neutron stability problems with OpenVSwitch

asked 2014-03-27 05:52:35 -0500

ccowley gravatar image

updated 2014-03-28 15:57:32 -0500

smaffulli gravatar image

I have a fairly simple Openstack setup for a PoC. 2 nodes, both running Nova, and everything else on node 1. It is running CentOS 6 and was set up using RDO. Importantly I am using Neutron for the networking, with GRE tenant networks set up from the RDO docs for an existing network.

Periodically (every few days I reckon) I lose all communication with Openvswitch (and thus my instances). I know it OVS, because I can SSH into node 2, then connect to node 1 via their private network. The most telling thing I see in the logs is this:

unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)

In addition OVS is using HUGE amounts of CPU (800% on my 16-core boxes), and when I try and do a clean shutdown, it just never happens because it cannot kill ovsdb-server.

I have done some Googling and found some old suggestions based on older Openstack releases where people had OVS/kernel version mismatches. As I am running the versions from RDO I reckon I can discount that (unless Red Hat have made a massive screw up).

Anyone else seen this? have any suggestions?

PS: Do not tell me to recompile Openvswitch, for various reasons that is not happening in the immediate future.

edit retag flag offensive close merge delete

1 answer

Sort by ยป oldest newest most voted

answered 2014-03-29 06:02:31 -0500

updated 2014-03-30 17:50:30 -0500

Which version OpenStack, which version RDO repo are you using? I'm merely guessing with such little detail, but looks as you indicate some kind of issue with OpenvSwitch and your kernel, a runaway OVS process. Could likely be database or messaging agent related.

Check your qpid logs: /var/log/messages for something that shows a reason for disconnect at the time of your instance communication loss. This could reveal as to why there may be messaging disconnects and whether caused by messaging connect failure (external/tertiary cause); or the other way around, caused by OVS disconnect (likely OVS/kernel build issue).

Since RDO is "...tested on a RHEL 6.4", I would be using CentOS 6.4 minimum, rather than 6 as you state. Even better use 6.5 as there are a number of components included in the kernel, rather than patched as required with RDO.

Additional troubleshooting on your behalf is difficult without logs and details of your config, but after you have assessed this, suffice to say that there are known Neutron configuration challenges to overcome with GRE and MTU settings.

In any case for a successful OpenStack build (no matter how basic, it is complicated), you need to start with a supported and up to date build of OS, kernel and OVS. How can you be sure that you can discount "OVS/kernel version mismatch", what versions are you using?

I'd suggest you configure with latest CentOS 6.5 and RDO, then re-post if issue persists (with updated details, logfiles, etc) additionally on RDO forum: as then you will get the distro specific details that you may need.

EDIT: Check dhcp.ini and dnsmask config via these articles for MTU settings, apparrently 1454 should be about right for guest instances when running GRE:

Don't forget there could still be issues with MTU and GRE depending on your kernel and OVS versions, so please advise what versions you have and update your post, so you can assist with others having similar issues as well, On both nodes show results for:

uname -a

rpm -qpi | grep openvswitch

Also take a look at your OVS GRE flows and run some tcpdumps in the relevant qrouter namespace when you are making your large 20G transfer, this guide from RDO RDO will help, tale a look at Joe Talerico's great GRE debugging on two node explanation at 60 minutes onwards:

And finally you also need to check you aren't being affected by Generic Receive Offload config as per post #24:

edit flag offensive delete link more


Sorry should have been more specific, I am on 6.5 and latest RDO already.

I think I have narrowed it down to Glance using up all the disk/network when deploying a large (20GB Windows) instance. MTU config could explain that. I hope to investigate more over the next few weeks.

ccowley gravatar imageccowley ( 2014-03-29 08:28:09 -0500 )edit

Get to know Ask OpenStack

Resources for moderators

Question Tools



Asked: 2014-03-27 05:52:35 -0500

Seen: 1,329 times

Last updated: Mar 30 '14