Duplicate and lost packets with dvr between instances both with floating ips

asked 2015-06-03 00:52:54 -0500

Set up is OpenStack Kilo with Centos 7 using DVR with vxlan:

# cat /etc/redhat-release 
CentOS Linux release 7.1.1503 (Core) 

# uname -r

# rpm -qa|grep neutron

When pinging between 2 instances that both have floating ips using their floating ips and are on different compute nodes there is high packet loss as well as duplicate packets.

[root@testy-test-3 ~]# ping -D -O
PING ( 56(84) bytes of data.
[1433306467.426536] 64 bytes from icmp_seq=1 ttl=60 time=1.71 ms
[1433306468.427374] 64 bytes from icmp_seq=2 ttl=60 time=0.682 ms
[1433306469.427393] 64 bytes from icmp_seq=3 ttl=60 time=0.632 ms
[1433306470.427428] 64 bytes from icmp_seq=4 ttl=60 time=0.669 ms
[1433306471.427428] 64 bytes from icmp_seq=5 ttl=60 time=0.669 ms
[1433306472.427368] 64 bytes from icmp_seq=6 ttl=60 time=0.631 ms
[1433306473.427391] 64 bytes from icmp_seq=7 ttl=60 time=0.666 ms
[1433306475.426710] no answer yet for icmp_seq=8
[1433306475.427644] 64 bytes from icmp_seq=9 ttl=60 time=0.865 ms
[1433306476.428339] 64 bytes from icmp_seq=10 ttl=60 time=0.614 ms
[1433306477.428342] 64 bytes from icmp_seq=11 ttl=60 time=0.630 ms
[1433306478.428254] 64 bytes from icmp_seq=12 ttl=60 time=0.510 ms
[1433306479.428308] 64 bytes from icmp_seq=13 ttl=60 time=0.571 ms
[1433306480.428366] 64 bytes from icmp_seq=14 ttl=60 time=0.645 ms
[1433306481.428444] 64 bytes from icmp_seq=15 ttl=60 time=0.697 ms
[1433306482.428349] 64 bytes from icmp_seq=16 ttl=60 time=0.620 ms
[1433306483.428365] 64 bytes from icmp_seq=17 ttl=60 time=0.623 ms
[1433306484.428282] 64 bytes from icmp_seq=18 ttl=60 time=0.551 ms
[1433306484.428307] 64 bytes from icmp_seq=18 ttl=60 time=0.564 ms (DUP!)
[1433306485.428417] 64 bytes from icmp_seq=19 ttl=60 time=0.664 ms
[1433306486.428307] 64 bytes from icmp_seq=20 ttl=60 time=0.584 ms
[1433306486.428334] 64 bytes from icmp_seq=20 ttl=60 time=0.598 ms (DUP!)
[1433306487.428529] 64 bytes from icmp_seq=21 ttl=60 time=0 ...
1 answer

answered 2015-06-05 23:39:24 -0500

ross-annetts gravatar image

The problem turned out to be an issue with the bonded ethernet interface being used for br-ex on both the compute nodes. Linux bonding round-robin mode was being used to connect the compute nodes to a pair of stacked switches. Changing the bonding mode to LACP and enabling LACP on the switch side resolved the issue.

