Openstack high availability pacemaker or keepalived?

asked 2015-11-09 09:59:42 -0600

Satyanarayana Patibandla gravatar image


I am planning to set up high availability for controller node, for that I am using haproxy. There are 2 options available. Either I can use haproxy/pacemaker/corosync or haproxy/keepalived. Could you please let me know which is the best option for production deployment.

edit retag flag offensive close merge delete


@satya, Personally , I have experience only with haproxy/keepalived, so cannot address a question

dbaxps gravatar imagedbaxps ( 2015-11-09 11:41:16 -0600 )edit

However, I have a question: Do you have experience with rejoining into 3 node mariadb-galera-server MultiMater Synchronous Replica of node which crashed due hardware failure and now is ready to be brought back into replication system. I am former Informix DBA been working in US on similar projects.

dbaxps gravatar imagedbaxps ( 2015-11-09 11:46:30 -0600 )edit

@dbaxps Only one node failed or cluster failed and out of sync. If only one node than is as simple as starting mysql on the node that failed and will sync with the other two nodes. Even though there are differences, last state of db will be the one present on the majority.

capsali gravatar imagecapsali ( 2015-11-10 05:48:21 -0600 )edit

If all nodes in the cluster failed then you need to recreate the cluster, starting the first node with --wsrep-new-cluster option and than joining the other nodes to the cluster. But i guess you already knew that!

capsali gravatar imagecapsali ( 2015-11-10 05:51:29 -0600 )edit

2 answers

Sort by ยป oldest newest most voted

answered 2015-11-10 05:45:45 -0600

capsali gravatar image

In my understanding haproxy/pacemaker/corosync is active/passive type of HA and haproxy/keepalive is active/active HA.

It really depends on the way you want to go. I personally didn;t use active/passive untill now. In our deployment we have a haproxy/keepalive type of HA.

There are advantages and disadvantages to both of them.

A disadvantage of active/passive take is that one node/service will stay in passive mode (will not be active) untill the active one malfunctions and this one takes over. There is no load balancing either being that only one service is active at a given time.

In active/active scenario, all nodes/services of the same type are running simultaniously. We use haproxy for load balancing and keepalive for failover of haproxy. This method is a little more complicated to set up, but you now run at full power and loadbalance the workload.

Of course this scenario has its disadvantages too like not all services work in active/active state [e.g. l3_agent (although i highly recommand neutron DVR where all traffic, be it north-south or east-west, is done on the compute node the instance is running bypassing network node and providing loadbalancing and elimianting single point of failuire), cinder-volume (only works in active/passive mode by providing the same host parameter in cinder.conf), mysql (in master/master replica, it is wise and suggested that services use only one mysql host to write data to so to prevent db locking and differences on different cluster hosts; so in a 3 galera cluster, only one host is active while the other 2 are in passive mode and used as backup in haproxy ) ].

Also keep in mind that all requests are done at the haproxy node (be it openstack services requests, client api requests or mysql db requests).

This being said we are happy with our active/active HA scenario for over a year now. There were a few hiccups along the way (primary being mariadb galera cluster wouldn;t play nice with haproxy due to openstack services keeping a connections opened too long, but has been resolved in the mean time ).

edit flag offensive delete link more


When I go through
I can see a command:

openstack-config --set /etc/neutron/neutron.conf DEFAULT l3_ha True

Doesn't it mean that I should forget about DVR on current Liberty Release

dbaxps gravatar imagedbaxps ( 2015-11-10 06:48:25 -0600 )edit
dbaxps gravatar imagedbaxps ( 2015-11-10 06:51:55 -0600 )edit

Well it depends on what you want to achieve. With l3_agent in ha mode the configuration stays the same, meaning all traffic passes through the network node, both external traffic (SNAT/DNAT) AND inter-VM traffic aswell. This could become a bottleneck from a network POV.

capsali gravatar imagecapsali ( 2015-11-10 07:10:47 -0600 )edit

As far as i know, when i looked for l3_ha docs back in juno release i think, it was more of a failover feature for l3_agent. Routing would have been done through one network node and in case that l3_agent stops the other one takes over, so no load balancing. I don't know if this has changed!

capsali gravatar imagecapsali ( 2015-11-10 07:12:58 -0600 )edit

In DVR mode, l3 and l2 routing are moved to the compute nodes. So when we associate a FIP to an instance, DNAT is managed by the compute node the instance is running on. SNAT for instances without a FIP is still managed by the l3_agent on the network node!

capsali gravatar imagecapsali ( 2015-11-10 07:15:09 -0600 )edit

answered 2015-11-10 13:11:24 -0600

Satyanarayana Patibandla gravatar image

Thanks for your suggestions. I will evaluate all the suggestions mentioned above. If we use DVR, I came to know that there are problems with the use of VLANs, IPv6, Floating IPs, high north-south traffic scenarios and large numbers of compute nodes. Could you please let me know whether you have used DVR in production? Did you face any issues mentioned above.

edit flag offensive delete link more


The core issue :- 3 Node HA Controllers Cluster (HAProxy/keepalived) is using VRRP , i.e finally. requires creating HA Neutron Router. Command neutron l3-agent-list-hosting-router RouterHA will return all three members of your MultiMaster Synchronious Galera replica.

dbaxps gravatar imagedbaxps ( 2015-11-10 13:46:51 -0600 )edit

The last IS NOT compatible with DVR in meantime. See

dbaxps gravatar imagedbaxps ( 2015-11-10 13:48:58 -0600 )edit

We use DVR in production but: 1. Vlan is used for provider network ,tenant network uses vxlan (more convinient for us),

  1. We do not use ipv6 but didn't see any complains for DVR besides the general ones,
capsali gravatar imagecapsali ( 2015-11-10 14:27:45 -0600 )edit
  1. There were problems in juno DVR with FIPs where if you diassociated a FIP from an instance on a compute node and associated it to another instance on another compute node, neutron wouldn;t pick that up and rendered the FIP useless untill a ovs-cleanup. It has been resolved since then
capsali gravatar imagecapsali ( 2015-11-10 14:28:57 -0600 )edit

3.1 Another problem was with FIP namespaces that kept changing after a node reboot but fixed aswell!

  1. We have a constant 25-50 MB/s throughoutput from about a dozen instances almost permanently, so i don't think high north-south traffic is a problem
capsali gravatar imagecapsali ( 2015-11-10 14:31:54 -0600 )edit

Get to know Ask OpenStack

Resources for moderators

Question Tools



Asked: 2015-11-09 09:59:42 -0600

Seen: 8,323 times

Last updated: Nov 10 '15