Strange self-healing Icehouse Neutron/OVS performance issue

I have a very strange network performance issue with our Icehouse installation using Neutron with OVS. We have 11 compute nodes with a combined controller/network node and a storage node for Cinder, all running on CentOS 6. External switches are all gigabit.

When a new instance is launched, often the network performance to nodes outside of OpenStack is in the 5 - 25Mb/s range (per iperf). Performance between instances is 500Mb/s - 1Gb/s consistently, whether to another instance on the same or a different compute node. This happens with instances on multiple compute nodes.

Where this gets strange is that if a new instance is left for a while (overnight, or maybe a day or more - I haven't been able to gather accurate figures), the performance will jump to normal, expected levels (i.e. up to 1Gb/s to the external node, as it already is to the internal ones). Once the problem on a given instance has resolved itself, it doesn't seem to ever manifest itself again.

I've forced the MTU to 1400 to allow for the GRE tunnels a long time back (resolving an earlier performance issue), and I've tried turning off offloading in the various NICs in play to no avail. Monitoring a "good" and "bad" instance running iperf with tcpdump against the tap devices shows nothing obviously wrong (to me).

Here's the results of some tests I performed between 3 test instances, and a server outside of OpenStack. As can be seen, test1 (which is on the nova7 compute node) is behaving quite well, but test2 (on nova7) and test3 (on nova8) are extremely slow when communicating outside of OpenStack.

Source/Dest     test1 (nova7)   test2 (nova7)   test3 (nova 8)  jerez
test1 (nova7)   N/A             584.000         759.667         896.333
test2 (nova7)   512.333         N/A             702.333         21.800
test3 (nova8)   717.000         739.667         N/A             15.133
jerez           394.000         1.640           1.943           N/A

Note: Speeds in Mb/s, average of 3 iperf tests


I'd appreciate any suggestions on how to resolve or debug this as I'm stumped at this my point, and my users are becoming disgruntled!

