Why is openvswitch forwarding arp packets between external and internal bridge?

asked 2014-10-01 02:20:14 -0600

Krist gravatar image

In an attempt to spread the load over multiple L3 agents on multiple hosts I inadvertently created an ARP broadcast storm on our network. The only explanation I have for this is that somehow ARP packets jumped from an external to the tunnel or integration bridge, travelled around via the GRE tunnel mesh and then jumped back. I wonder why this could happen.

The setup: - We have Havana, on RHEL 6.5. - All nodes, network, service and compute are connected to the same physical layer network. We have VLANs to separate management from storage and tenant traffic. - Tenant traffic itself is encapsulated in GRE tunnels. - We have separate networks for tenant traffic, for external traffic and management traffic, as well as one for a service network. - Three network nodes. They are connected to all networks. I have configured them as follows: - Use the tenant network interface for the GRE tunnels to the compute nodes. - One external OVS bridge br-ext, with in interface in to the external network connected in to it, and one bridge with an interface in the service network. br-ex and br-serv - Two L3 agents, with gateway_external_network_id set to the ids of the external, reps. the service network, and each their external_network_bridge parameter set to the right bridge. - And addition L3 agent for internal only routers.

This all went well, until suddenly everything went to hell. We saw a massive ARP broadcast storm. Something was looping.

It was only after a long time that I noticed something: The ARP packets were circulating on both the external and the service network, and were also travelling encapsulated in the GRE tunnels. I powered down all but one of the network notes and the storm went aways.

So somehow packets were being forwarded on the ovswitch from br-ext to br-tun (or br-int) and probably also to br-serv, on at least one ovswitch node, and this created a loop.

But what could have caused this? I though that all traffic between be-ext and br-tun/br-int would have to pass through a L3 routing instance, so APR packets would not be forwarded. Something is seriously going wrong. Where do I start looking=

edit retag flag offensive close merge delete

Comments

I have experienced the same thing on Icehouse with a similar setup. Compute nodes and switches got high CPU loads. As a solution, we switched to a single network node for the moment, but that doesn't scale well. With Juno's DVR, each node becomes a L3 Agent and the risks of flooding may increase.

js.mouret gravatar imagejs.mouret ( 2014-10-30 12:37:42 -0600 )edit