Ask Your Question
1

Strange self-healing Icehouse Neutron/OVS performance issue

asked 2015-04-08 08:03:11 -0500

pgSnake gravatar image

Hi

I have a very strange network performance issue with our Icehouse installation using Neutron with OVS. We have 11 compute nodes with a combined controller/network node and a storage node for Cinder, all running on CentOS 6. External switches are all gigabit.

When a new instance is launched, often the network performance to nodes outside of OpenStack is in the 5 - 25Mb/s range (per iperf). Performance between instances is 500Mb/s - 1Gb/s consistently, whether to another instance on the same or a different compute node. This happens with instances on multiple compute nodes.

Where this gets strange is that if a new instance is left for a while (overnight, or maybe a day or more - I haven't been able to gather accurate figures), the performance will jump to normal, expected levels (i.e. up to 1Gb/s to the external node, as it already is to the internal ones). Once the problem on a given instance has resolved itself, it doesn't seem to ever manifest itself again.

I've forced the MTU to 1400 to allow for the GRE tunnels a long time back (resolving an earlier performance issue), and I've tried turning off offloading in the various NICs in play to no avail. Monitoring a "good" and "bad" instance running iperf with tcpdump against the tap devices shows nothing obviously wrong (to me).

Here's the results of some tests I performed between 3 test instances, and a server outside of OpenStack. As can be seen, test1 (which is on the nova7 compute node) is behaving quite well, but test2 (on nova7) and test3 (on nova8) are extremely slow when communicating outside of OpenStack.

Source/Dest     test1 (nova7)   test2 (nova7)   test3 (nova 8)  jerez
test1 (nova7)   N/A             584.000         759.667         896.333
test2 (nova7)   512.333         N/A             702.333         21.800
test3 (nova8)   717.000         739.667         N/A             15.133
jerez           394.000         1.640           1.943           N/A 

Note: Speeds in Mb/s, average of 3 iperf tests

I'd appreciate any suggestions on how to resolve or debug this as I'm stumped at this my point, and my users are becoming disgruntled!

Thanks, Dave.

edit retag flag offensive close merge delete

Comments

Hi Dave,

Is the iperf test done using tcp or udp ?

If you make a capture, do you see some issues (instance, compute ovs, tunnel, network ovs) ?

Charles Benon gravatar imageCharles Benon ( 2015-04-08 09:36:06 -0500 )edit

Hi. The tests above were using tcp. With udp I seem to get around 1Mb/s everywhere and I also sometimes see one or both of the following msgs in the output: "WARNING: did not receive ack of last datagram after 10 tries."/"read failed: No route to host". That doesn't happen with tcp though.

pgSnake gravatar imagepgSnake ( 2015-04-08 10:03:03 -0500 )edit

I didn't see any (obvious) issues capturing the tap devices on the nova node. I'll try to capture from elsewhere too and see if anything shows up. Thanks.

pgSnake gravatar imagepgSnake ( 2015-04-08 10:04:24 -0500 )edit

Update: the udp iperf errors were caused by a firewall config issue. Still getting just 1Mb/s with udp once that's fixed though.

pgSnake gravatar imagepgSnake ( 2015-04-08 10:28:34 -0500 )edit

What are the results if:

  • Same compute node - Same tenant network
  • Same compute node - Different tenant network
  • Different compute node - Same tenant network
  • Different compute node - Different tenant network
  • Instance run from local storage (not cinder)

Do you use cgroups ?

Charles Benon gravatar imageCharles Benon ( 2015-04-08 11:03:17 -0500 )edit

1 answer

Sort by ยป oldest newest most voted
0

answered 2015-04-22 04:06:04 -0500

pgSnake gravatar image

For the benefit of others, this issue was eventually traced to a layer 3 switch outside of OpenStack that had overflowed its ARP cache. New entries were not being cached, leading to slow performance, and as old entries were aged out, eventually there was space for the new ones at which point performance returned to normal for that instance. The issue was resolved by reconfiguring the switch with a larger ARP cache.

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Get to know Ask OpenStack

Resources for moderators

Question Tools

3 followers

Stats

Asked: 2015-04-08 08:03:11 -0500

Seen: 296 times

Last updated: Apr 22 '15