Ask Your Question
1

Instances lose networking after a few minutes

asked 2014-10-04 12:04:05 -0500

erikmack gravatar image

Repro

  1. 'nova boot' an instance (have repro'd on CirrOS 0.3.3, Ubuntu Trusty, RHEL 6.5 guests) - DHCP lease is granted for one day
  2. associate a floating IP and SSH in with keypair - happy!
  3. 'curl google.com' from instance, receive small markup blob - happy!
  4. (optional) log out SSH
  5. Wait 5-30 minutes, and SSH again (or interact with existing session)
    • Expect: more networking
    • Actual: 'no route to host' if new SSH session, or old session is hung
  6. log into console via 'nova get-vnc-console cirr01 novnc'
  7. '/etc/init.d/S40network restart' (on CirrOS, or equivalent). A new one-day DHCP lease is acquired.
  8. retry SSH, networking works again (for another 5-30 minutes)

Setup Notes

  • Juno from RDO repo
  • RHEL 7
  • Manual setup from 'trunk' (draft Juno) docs (no installer)
  • three hosts - controller, network, compute
  • GRE tunnels
  • public/management networks on interface eno1, private on eno2 (dedicated switch)
  • All physical hosts are old Dell PowerEdge (2930 I think)
  • have run 'ethtool -K eno1 gro off; ethtool -K eno2 gro off; ethtool -K eno1 gso off; ethtool -K eno2 gso off' on network/compute hosts to remediate https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1313591 (this issue) with the bnx2 driver on PowerEdge
  • have repro'd with and without vhost_net (use_virtio_for_bridges in nova.conf)
  • logs are clean for the period when networking is lost

Here's the 'ps faux' output for the instance (with vhost_net):

/usr/libexec/qemu-kvm -name instance-00000031 -S -machine pc-i440fx-rhel7.0.0,accel=kvm,usb=off -cpu Opteron_G3,+wdt,+skinit,+ibs,+osvw,+3dnowprefetch,+cr8legacy,+extapic,+cmp_legacy,+3dnow,+3dnowext,+pdpe1gb,+fxsr_opt,+mmxext,+ht,+vme -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 7559fbb4-98bd-4b72-857a-855cab7895c3 -smbios type=1,manufacturer=Fedora Project,product=OpenStack Nova,version=2014.2-0.4.b3.el7.centos,serial=cb73e19f-01e2-4db4-9729-1e70872ef3fe,uuid=7559fbb4-98bd-4b72-857a-855cab7895c3 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000031.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -no-kvm-pit-reinjection -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/nova/instances/7559fbb4-98bd-4b72-857a-855cab7895c3/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=23,id=hostnet0,vhost=on,vhostfd=24 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:6e:d6:04,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/7559fbb4-98bd-4b72-857a-855cab7895c3/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0 -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

Some ethtool dumps on the compute host:

[root@juno-2 ~]# ethtool -k eno1
Features for eno1:
rx-checksumming: on
tx-checksumming: on
        tx-checksum-ipv4: on
        tx-checksum-ip-generic: off [fixed]
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on
        tx-tcp6-segmentation: off ...
(more)
edit retag flag offensive close merge delete

Comments

Were you able to resolve this problem?

larsks gravatar imagelarsks ( 2015-03-05 12:19:38 -0500 )edit

We are seeing the same thing on our EL7 OpenStack VM's. Sometimes they'll go for weeks, then all of a sudden they lose the route. "sudo systemctl restart network" resolves the issue, however that's just a workaround.

FlakRat gravatar imageFlakRat ( 2015-07-26 17:41:31 -0500 )edit

1 answer

Sort by ยป oldest newest most voted
0

answered 2015-07-26 18:09:16 -0500

https://openstack.nimeyo.com/36778/openstack-operators-centos-instances-losing-network-gateway (This post) has a good explanation (final comment) of what may be going on.

The author suggest that the addition of "valid_lft and preferred_lft timeout on the IPv4 address" is causing the issue.

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Get to know Ask OpenStack

Resources for moderators

Question Tools

2 followers

Stats

Asked: 2014-10-04 12:04:05 -0500

Seen: 1,619 times

Last updated: Oct 04 '14