Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

Cannot traceroute from VM1 to VM2, but can traceroute to external IP?

I've inherited an icehouse setup and have noticed slow network access when I've been copying files to/from various VMs, while other VMs (same & different tenants) seem uneffected. To try and debug the issue I've been running commands such as traceroute on the affected VMs to try and diagnose.

Inside a slow VM

Traceroute internal IP

$ traceroute 192.168.111.43
traceroute to 192.168.111.43 (192.168.111.43), 30 hops max, 60 byte packets
 1  * * *
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *
11  * *^C

Traceroute external IP

$ traceroute 208.43.102.250
traceroute to 208.43.102.250 (208.43.102.250), 30 hops max, 60 byte packets
 1  192.168.111.1 (192.168.111.1)  1.286 ms  1.252 ms  1.226 ms
 2  10.128.12.1 (10.128.12.1)  2.506 ms  2.230 ms  1.732 ms
 3  153.65.238.6 (153.65.238.6)  1.720 ms  1.712 ms  1.691 ms
...
...
18  173.192.18.193 (173.192.18.193)  81.912 ms 173.192.18.189 (173.192.18.189)  70.501 ms 173.192.18.193 (173.192.18.193)  74.536 ms
19  208.43.118.138 (208.43.118.138)  70.647 ms 208.43.118.134 (208.43.118.134)  70.931 ms 208.43.118.138 (208.43.118.138)  70.417 ms
20  208.43.102.250 (208.43.102.250)  70.325 ms  76.032 ms  85.983 ms

Routing

$ ip route
192.168.111.0/24 dev eth0  proto kernel  scope link  src 192.168.111.28
169.254.0.0/16 dev eth0  scope link  metric 1002
default via 192.168.111.1 dev eth0  proto static

Downloading a 10MB file from the internet

$ wget -O /dev/null http://208.43.102.250/downloads/test10.zip
--2015-05-20 02:24:16--  http://208.43.102.250/downloads/test10.zip
Connecting to 208.43.102.250:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11536384 (11M) [application/zip]
Saving to: “/dev/null”

100%[===============================================================================================================================================================>] 11,536,384  3.15M/s   in 4.5s

2015-05-20 02:24:21 (2.46 MB/s) - “/dev/null” saved [11536384/11536384]

Copying 50MB file from VM1 to VM2 (same subnet)

$ scp file.dat 192.168.111.33:~
file.dat                                                                                                                              4% 2112KB 909.2KB/s - stalled -^C
file.dat                                                                                                                              4% 2208KB 827.8KB/s   00:59 ETA
$ Killed by signal 2.

NOTE: I used Control+C to stop the above scp. It was stalling out and taking a long time to complete. These copies typically start out at around 1.5MB-2MB/s but then dwindle down as they continue and bounce around 30-50KB/s.

On other VMs where do not exhibit this problem, a scp of a 50MB file is typically almost instantaneous.

Cannot traceroute from VM1 to VM2, but can traceroute to external IP?

I've inherited an icehouse setup and have noticed slow network access when I've been copying files to/from various VMs, while other VMs (same & different tenants) seem uneffected. To try and debug the issue I've been running commands such as traceroute on the affected VMs to try and diagnose.

Inside a slow VM

Traceroute internal IP

$ traceroute 192.168.111.43
traceroute to 192.168.111.43 (192.168.111.43), 30 hops max, 60 byte packets
 1  * * *
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *
11  * *^C

Traceroute external IP

$ traceroute 208.43.102.250
traceroute to 208.43.102.250 (208.43.102.250), 30 hops max, 60 byte packets
 1  192.168.111.1 (192.168.111.1)  1.286 ms  1.252 ms  1.226 ms
 2  10.128.12.1 (10.128.12.1)  2.506 ms  2.230 ms  1.732 ms
 3  153.65.238.6 (153.65.238.6)  1.720 ms  1.712 ms  1.691 ms
...
...
18  173.192.18.193 (173.192.18.193)  81.912 ms 173.192.18.189 (173.192.18.189)  70.501 ms 173.192.18.193 (173.192.18.193)  74.536 ms
19  208.43.118.138 (208.43.118.138)  70.647 ms 208.43.118.134 (208.43.118.134)  70.931 ms 208.43.118.138 (208.43.118.138)  70.417 ms
20  208.43.102.250 (208.43.102.250)  70.325 ms  76.032 ms  85.983 ms

Routing

$ ip route
192.168.111.0/24 dev eth0  proto kernel  scope link  src 192.168.111.28
169.254.0.0/16 dev eth0  scope link  metric 1002
default via 192.168.111.1 dev eth0  proto static

Downloading a 10MB file from the internet

$ wget -O /dev/null http://208.43.102.250/downloads/test10.zip
--2015-05-20 02:24:16--  http://208.43.102.250/downloads/test10.zip
Connecting to 208.43.102.250:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11536384 (11M) [application/zip]
Saving to: “/dev/null”

100%[===============================================================================================================================================================>] 11,536,384  3.15M/s   in 4.5s

2015-05-20 02:24:21 (2.46 MB/s) - “/dev/null” saved [11536384/11536384]

Copying 50MB file from VM1 to VM2 (same subnet)

$ scp file.dat 192.168.111.33:~
file.dat                                                                                                                              4% 2112KB 909.2KB/s - stalled -^C
file.dat                                                                                                                              4% 2208KB 827.8KB/s   00:59 ETA
$ Killed by signal 2.

NOTE: I used Control+C to stop the above scp. It was stalling out and taking a long time to complete. These copies typically start out at around 1.5MB-2MB/s but then dwindle down as they continue and bounce around 30-50KB/s.

On other VMs where do not exhibit this problem, a scp of a 50MB file is typically almost instantaneous.

Background

  • As I understand our setup, we're using GRE tunnels with Open vSwitch
  • The nodes are a mix of CentOS 6.4 & 7.0 running OpenStack Icehouse
  • There is a mix of 2 versions of OVS (1.10 and 2.x) - I can check the versions if you think it's important

My suspicions

I originally thought it had something to do with GRO/TSO/etc. which seem to be the leading solutions when you search for "network performance slow openstack" via Google. I did experiment with changing these and did not see any noticeable difference.

I'm very suspicious of the mixture of CentOS 6.4 & 7.0 as the bare metal OSes within the installation but am not sure how to proceed in debugging this. I'm tempted to setup Kilo and migrate away from this setup, but am looking to buy myself some time to tackle that task just yet.

Cannot traceroute from VM1 to VM2, but can traceroute to external IP?

I've inherited an icehouse setup and have noticed slow network access when I've been copying files to/from various VMs, while other VMs (same & different tenants) seem uneffected. To try and debug the issue I've been running commands such as traceroute on the affected VMs to try and diagnose.

Inside a slow VM

Traceroute internal IP

$ traceroute 192.168.111.43
traceroute to 192.168.111.43 (192.168.111.43), 30 hops max, 60 byte packets
 1  * * *
 2  * * *
 3  * * *
 4  * * *
 5  * * *
 6  * * *
 7  * * *
 8  * * *
 9  * * *
10  * * *
11  * *^C

Traceroute external IP

$ traceroute 208.43.102.250
traceroute to 208.43.102.250 (208.43.102.250), 30 hops max, 60 byte packets
 1  192.168.111.1 (192.168.111.1)  1.286 ms  1.252 ms  1.226 ms
 2  10.128.12.1 (10.128.12.1)  2.506 ms  2.230 ms  1.732 ms
 3  153.65.238.6 (153.65.238.6)  1.720 ms  1.712 ms  1.691 ms
...
...
18  173.192.18.193 (173.192.18.193)  81.912 ms 173.192.18.189 (173.192.18.189)  70.501 ms 173.192.18.193 (173.192.18.193)  74.536 ms
19  208.43.118.138 (208.43.118.138)  70.647 ms 208.43.118.134 (208.43.118.134)  70.931 ms 208.43.118.138 (208.43.118.138)  70.417 ms
20  208.43.102.250 (208.43.102.250)  70.325 ms  76.032 ms  85.983 ms

Routing

$ ip route
192.168.111.0/24 dev eth0  proto kernel  scope link  src 192.168.111.28
169.254.0.0/16 dev eth0  scope link  metric 1002
default via 192.168.111.1 dev eth0  proto static

Downloading a 10MB file from the internet

$ wget -O /dev/null http://208.43.102.250/downloads/test10.zip
--2015-05-20 02:24:16--  http://208.43.102.250/downloads/test10.zip
Connecting to 208.43.102.250:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11536384 (11M) [application/zip]
Saving to: “/dev/null”

100%[===============================================================================================================================================================>] 11,536,384  3.15M/s   in 4.5s

2015-05-20 02:24:21 (2.46 MB/s) - “/dev/null” saved [11536384/11536384]

Copying 50MB file from VM1 to VM2 (same subnet)

$ scp file.dat 192.168.111.33:~
file.dat                                                4% 2112KB 909.2KB/s - stalled -^C
file.dat                                                4% 2208KB 827.8KB/s   00:59 ETA
$ Killed by signal 2.

NOTE: I used Control+C to stop the above scp. It was stalling out and taking a long time to complete. These copies typically start out at around 1.5MB-2MB/s but then dwindle down as they continue and bounce around 30-50KB/s.

On other VMs where do not exhibit this problem, a scp of a 50MB file is typically almost instantaneous.

Background

  • As I understand our setup, we're using GRE tunnels with Open vSwitch
  • The nodes are a mix of CentOS 6.4 & 7.0 running OpenStack Icehouse
  • There is a mix of 2 versions of OVS (1.10 and 2.x) - I can check the versions if you think it's important

My suspicions

I originally thought it had something to do with GRO/TSO/etc. which seem to be the leading solutions when you search for "network performance slow openstack" via Google. I did experiment with changing these and did not see any noticeable difference.

I'm very suspicious of the mixture of CentOS 6.4 & 7.0 as the bare metal OSes within the installation but am not sure how to proceed in debugging this. I'm tempted to setup Kilo and migrate away from this setup, but am looking to buy myself some time to tackle that task just yet.