Ask Your Question
3

OVS flows missing after Icehouse upgrade

asked 2014-07-10 13:22:17 -0500

jproulx gravatar image

updated 2014-07-10 14:24:06 -0500

After upgrade from Havana to Icehouse including ovs to ml2 migration, as documented at http://docs.openstack.org/openstack-o... , network sevice to my instances is broken. Running instances loose network and new instances fail to spawn, this same bahaviour persists after rebooting the hypervisor

Systems are Ubuntu12.04 + cloud archive and use VLAN based provider networks (primarily) and GRE based tenant overlays (rarely).

Breakage happens when compute node is upgraded. The cause seems to be a missing flow in the br-eth1. On a working Havana system I see:

# ovs-ofctl dump-flows br-eth1
NXST_FLOW reply (xid=0x4):
 cookie=0x0, duration=185347.841s, table=0, n_packets=113939229,n_bytes=12298686848, idle_age=0, hard_age=65534,priority=4,in_port=9,dl_vlan=1 actions=mod_vlan_vid:2113,NORMAL
 cookie=0x0, duration=185393.285s, table=0, n_packets=18,n_bytes=3384, idle_age=65534, hard_age=65534, priority=2,in_port=9 actions=drop
 cookie=0x0, duration=185394.258s, table=0, n_packets=277410868,n_bytes=1295102884039, idle_age=0, hard_age=65534, priority=1 actions=NORMAL

but on the broken icehouse system that first flow that translates the internal VLAN tag (1) to the external VLAN (2113) is missing, so all traffic from my test nodes dies there (last seen on phy-br-eht1 ) and never makes it off the hypervisor:

# ovs-ofctl dump-flows br-eth1
NXST_FLOW reply (xid=0x4):
 cookie=0x0, duration=170.158s, table=0, n_packets=20, n_bytes=4068, idle_age=1, priority=2,in_port=6 actions=drop
 cookie=0x0, duration=171.088s, table=0, n_packets=142415, n_bytes=174549125, idle_age=0, priority=1 actions=NORMAL

There is a WARNING burried in the logs. This is from a hard reboot of a preexisting instance following hypervisor reboot (port id adfbfa1c-d5ee-450f-b434-05cc4639f74d). details at http://paste.openstack.org/show/85998/ (note I've translated to standard notation above but eth1-br is 'correct' in my environment & did not normalize in the paste). The warning line is:

WARNING neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Device adfbfa1c-d5ee-450f-b434-05cc4639f74d not defined on plugin

Not sure what to exect in the Icehouse logs but the Havana debug logs show lots of rpc traffic like:

DEBUG neutron.openstack.common.rpc.amqp [-] received {u'_context_roles': [u'admin'], u'_context_read_deleted': u'no', u'_context_tenant_id': None, u'args': {u'segmentation_id': 2113, u'physical_network': u'trunk', u'port': {u'status': u'ACTIVE', u'binding:host_id': None, u'name': u'', u'allowed_address_pairs': [], u'admin_state_up': True, u'network_id': u'0a1d0a27-cffa-4de3-92c5-9d3fd3f2e74d', u'tenant_id': u'6f9adccbd03e4d2186756896957a14bf', u'extra_dhcp_opts': [], u'binding:vif_type': u'ovs', u'device_owner': u'network:dhcp', u'binding:capabilities': {u'port_filter': True}, u'mac_address': u'fa:16:3e:f8:df:73', u'fixed_ips': [{u'subnet_id': u'76123b94-2ae9-40c4-a4c6-03ee98d081d9', u'ip_address': u'REDACTED'}], u'id': u'9bc9680d-2ae2-4ebd-a9dc-60ceb6fb8073', u'security_groups': [], u'device_id': u'dhcpd29260d0-57dc-5465-92ba-2051a6e8e549-0a1d0a27-cffa-4de3-92c5-9d3fd3f2e74d'}, u'network_type': u'vlan'}, u'namespace': None, u'_unique_id': u'62fd8d4f19d045fa913bbb3eaef6fc5a', u'_context_is_admin': True, u'version': u'1.0', u'_context_project_id': None, u'_context_timestamp': u'2014-07-03 17:22:42.672013', u'_context_user_id': None, u'method': u'port_update'} _safe_log /usr/lib/python2.7/dist-packages/neutron/openstack/common/rpc/common.py:276

which includes things like 'segmentation_id': 2113, u'physical_network': u'trunk', I noteably don't see ... (more)

edit retag flag offensive close merge delete

Comments

Thanks, here's full debug log from startup http://paste.ubuntu.com/7780336/

jproulx gravatar imagejproulx ( 2014-07-11 09:04:15 -0500 )edit

Great question, I suggest you and @darragh-oreilly to update the question and the answer as you find out more.

smaffulli gravatar imagesmaffulli ( 2014-07-11 16:46:19 -0500 )edit

@smaffulli: will do. @jproulx : the agent is not detecting any tap devices (except tapadfbfa1c-d5). There should be more if there are instances running. I don't understand why the procedure says to restart neutron-ovs-cleanup on the compute nodes.

darragh-oreilly gravatar imagedarragh-oreilly ( 2014-07-14 03:45:06 -0500 )edit

@darragh-oreilly this is a test node so there is only a single running instance at this point tapadfbfa1c-d5 is the only tap interface to find. Just back from vacation so hopefully fresh eyes will show me what I was missing.

jproulx gravatar imagejproulx ( 2014-07-21 09:12:57 -0500 )edit

2 answers

Sort by ยป oldest newest most voted
0

answered 2014-07-21 13:18:02 -0500

jproulx gravatar image

updated 2014-07-21 13:20:33 -0500

Turns out this issue was a neutron server misconfiguration where I should have had:

nova_admin_tenant_id = SERVICE_TENANT_ID #correct

I'd actually put:

nova_admin_tenant_id = SERVICE_TENANT_NAME #wrong

This did throw an Error on the server node that I thought was unrelated since the compute node was creating the instance ant tap device :

2014-07-21 11:25:53.240 3244 ERROR neutron.notifiers.nova [-] Failed to notify nova on events: [{'status': 'completed', 'tag': u'07fac3e5-72f5-4607-a13a-a52aa5a4ed84', 'name': 'network-vif-plugged', 'server_uuid': u'56951153-c974-4480-8562-17752db74ba8'}]

However fixing that error got the tap devices on the compute nodes properly tagged and in the right flows.

edit flag offensive delete link more
1

answered 2014-07-11 07:29:41 -0500

darragh-oreilly gravatar image

I think your config is ok. It seems port adfbfa1c-d5ee-450f-b434-05cc4639f74d is no longer on the plugin - that should not be a problem. The port_update rpc was changed a lot in Icehouse - that would expain the absence of message. Would need more of the agent log - including its startup.

edit flag offensive delete link more

Comments

so it seems I've been looking in the wrong direction, sifting trhough the full log I see teh port is getting tagged '4095' which according to https://ask.openstack.org/en/question... is essentially /dev/null for when the loca agent doesn't know what to do

jproulx gravatar imagejproulx ( 2014-07-21 10:44:21 -0500 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Get to know Ask OpenStack

Resources for moderators

Question Tools

1 follower

Stats

Asked: 2014-07-10 13:22:17 -0500

Seen: 504 times

Last updated: Jul 21 '14