Odd ARP behaviour in provider network.

asked 2014-09-30 04:43:16 -0500

Krist gravatar image

updated 2014-09-30 04:45:16 -0500

I am having a rather odd issue with a provider network I set up. I think there is some very weird APR issue going on, but I don't know how to properly diagnose it, or solve it.

Or setup is Openstack Havana, on RHEL 6.5

I've defined a network "service", that is an external provider network of type local. I have an L3 agent configured so that this network is mapped to a particular interface on the node this l3 agent runs on. Via this network I want to make it possible for instances to connect to servers in the service network. In this service network there are two hosts (these are not openstack hosts, they exists entirely outside our stack). These are named cds1 and cds2.

When I create a router in openstack and use it to connect a tenant network to this service network I run in to problems where sometimes the instances that I start in this tenant network can see cds1 and cds1, sometimes they only see one, and sometimes they see none. It's always consisten, that if one node doesn't see cds1, all others don't see it either. If one node sees it, the others see it too.

In order to investigate I logged in to the network node, and did an ip netns exec qrouter-<id> bash, and did some test.

Firstly, I noted that the router instance was correctly linked to each network. I could ping all openstack instances in networks connected to it without any problem. The interface connecting us to the physical service network is also properly configured and connected.

Then I did the following test:

  • Ping qrouter -> cds1: fails.
  • Ping qrouter -> cds2: fails.
  • Ping cds1 -> qrouter: first 5 to 10 pings fail, then it starts succeeding.
  • Ping qrouter -> cds1: now succeeds

So the odd thing is that once a ping from cds1 to the router succeeds, the ping from the router to cds1 succeeds too, and all openstack nodes that should see cds1 start seeing it again. The same I notice with cds2 I further noticed that clearing the arp cache on the router makes pings fail again, from both sides, and that clearing it as well on the server than makes pings work again after a while... Try again after a few minutes an pings fail again...

Then on a hunch I added static, permanent ARP entries in the qrouter namespace, tying the IP adresses of cds1 and cds2 to their correct MAC adresses. With these static entries all problems go away. I can reliable ping in both directions, and trafic gets routed correctly from the tenant networks to the service network.

This all points to some ARP weirdness. I'm at a loss at how to find out what is messing things up however. Any ideas?

edit retag flag offensive close merge delete