Basic active/standby HA scenario issue

asked 2018-09-21 07:54:36 -0600

Ludvic gravatar image

Hi! I tried a simple active/standby VM setup with keepalived in OpenStack Ocata:

                                     External machine
                                               |
                                               |
                                         Floating IP
                                               |
                                      shared fixed IP
                                           /      \
                                         /         \
                                   VM MASTER     VM BACKUP

The external machine is pinging constantly the Floating IP (which is bound to the shared IP). So when I stop keepalived on the MASTER VM the shared IP is instantaneously reassigned to keepalived BACKUP VM interface and the ping should continue seamlessly… But it doesn’t. The delay is usually 20+ seconds. Note that if I ping the shared IP from a VM inside OpenStack everything works great (and fast), so the problem is probably in the tenant neutron router.

Looking at a packet capture it seems like the tenant router doesn't update its ARP cache when receiving the gratuitous ARPs that the newly elected MASTER VM sends to broadcast its MAC address. Rather than that it periodically send a non-broadcast ARP to the "non-responsive" VM, without obviously receiving any response. And after 20+ seconds the tenant router finally decide to broadcast an ARP to know to who belongs the shared IP, and that is when connectivity is restored with the external machine.

Is there anyway to force the tenant router to update its ARP cache right away, when receiving the first gratuitous ARP?

Thank you for any hint that might enlightened my limited network knowledge!

edit retag flag offensive close merge delete