Upgrade single-control environment to high availability

asked 2018-03-08 09:55:35 -0600

eblock gravatar image

Hi experts,

I have an existing cloud (Ocata) which developed from demo to production environment. Now I have to find a way to make it highly available, starting with the control node.

The plan is to leave the existing single-controller up and running while I configure two new servers in HA mode with Pike release in the meantime. There are two main aspects causing some headaches: database and networking. I believe the database part could be tricky but manageable, stop mysql at some point and dump the DB, then import it to the new control node(s) (maybe on shared storage) running a galera cluster and hope that it works.

But what about neutron and the self-service networks with all the virtual routers etc.? Is it even possible to recreate the neutron environment on a different node? I read the guide on how to make neutron ha if you start from scratch, but is my approach realisticly possible?

I would really appreciate any insights from you guys. Is there maybe someone who has done this and could comment my approach?

Thank you in advance for any help!

edit retag flag offensive close merge delete

2 answers

Sort by ยป oldest newest most voted

answered 2020-07-15 02:15:49 -0600

eblock gravatar image

We finally got it working, thanks again @Peter Slovak for your thoughts! I'll share the key aspects of the setup:

We created a pacemaker cluster consisting of two control nodes, that's quite straight forward. The tricky part is the lacky documentation of openstack HA, I read all kinds of deployment guides and blog posts to get the main concepts.

  • Database: Galera mariadb with a third tiebreaker node (not part of pacemaker, only running garbd).
  • Neutron: We needed to switch from linuxbridge to openvswitch (we don't use dvr for now). The solution is to change the neutron table networksegments and replace/set the appropriate values for network_type, physical_network and segmentation_id to fit the actual (new) setup. For example, we changed all self-service networks from typ vlan to vxlan and remove the physical_network since it all is supposed to be handled by ovs. The provider networks got their br-provider entry (we just called it provider) and now are of type vlan instead of flat.
  • Migration: To actually migrate an instance to the new environment (luckily our storage backend is Ceph) we shut it down in the old cloud, change its host and node in nova.instances table and then run a nova reboot --hard <UUID> in the new cloud. This creates a new virsh xml configuration on a new compute node, but since the old network config is still present in nova.instance_info_caches the compute node builds a linuxbridge device (brq...) instead of an ovs interface (qbr...). In this case it can help to simply shutdown and restart the instance, but mostly that didn't work. The easiest way is to detach and re-attach the respective interface(s) of that instance. This triggers the compute node to recreate the instance's interface(s) with ovs. After restarting the VM it should now be reachable with it's designated IP.
  • Router: Be aware that you need to shutdown (disable) routers in the old environment to prevent the old control or network node to respond to requests. Otherwise this will lead to unreachable floating IPs for instances in self-service networks.

This did the trick for us, we are currently in the migration process (one by one) and have both environments active, in case anything breaks we still could move the already migrated instances back to the old cloud. But it's working for now and we're quite happy with it. We use OpenStack Train on openSUSE Leap 15.1 in the new environment.

edit flag offensive delete link more

answered 2018-03-08 17:55:17 -0600

A challenging task. I haven't done this myself, though I have experience with OpenStack HA setup "by design". Let's break it down by service type and network components, assuming you're familiar with the OpenStack HA guide and have an idea how to configure the individual services:

  • Database - Bootstrapping Galera cluster from a standalone node is a relatively uncomplicated task with just a short downtime needed. You place in the config, stop the service, start it with service mysql bootstrap (assuming systemd on Ubuntu 16.04) or a more versatile variant service mysql start --wsrep-new-cluster. Then you start the other nodes, wait for SST (initial data transfer) and then restart the first node, just so that it runs in "normal" mode and not in "bootstrap" mode, vaguely said. Have a look at this writeup, for example. WARNING: As always, do a backup before taking any action. Bootstrapping against an empty node by mistake WILL wipe out your database.
  • Message queue - This one also requires some downtime, but you face almost no danger since OpenStack doesn't keep any long-lived persistent data in there. Just configure RabbitMQ according to the HA guide and make a cluster out of the conrollers' instances.
  • Corosync - Virtual IP - This Virtual IP will be a new element in your cluster, so no breaking change here. Again, HA Guide has you covered. It's basically an IP address that's always up and present on the controller that's working.
  • HAProxy and all the "listening" services with it - Although it is possible to just run on the Virtual IP, I strongly recommend putting a traffic balancer in between the Virtual IP and other services. The HA Guide's HAProxy config is a very good start, but you'll have to configure all the services that listen on all interfaces by default to just listen on the controller's management IP address. This is because HAProxy will be listening on the Virtual IP address, which would be impossible if the port would already have been taken. You'll need to search through each project's config refrence; look for the keywords "bind" or "listen" in general (e.g. for Nova, it's osapi_compute_listen and metadata_listen for its two APIs)

And then there's Neutron. Basically this is a change that is independent of your effort to make management highly available, so I advise you not to do this in the same run, just to concentrate on one thing and to do it well.

The same goes for the OpenStack upgrade by the way - my personal approach would be: upgrade first, then do the management HA setup, then fiddle with Neutron.

In general, introducing a new network scenario to Neutron doesn't work without downtime (planned or not, sadly). I advise you to go through:

edit flag offensive delete link more


Thanks for your answer! I have 2 new servers that I want to setup in ha mode, the existing control node will be replaced by these two, just to clarify this aspect. We use linuxbridge in our environment, so this is the way to go.

eblock gravatar imageeblock ( 2018-03-09 01:41:37 -0600 )edit

We already have existing self-service networks although most VMs run in provider networks, so a downtime is not a deal breaker. But just to clarify: after configuring neutron according to ha guide I'll have to recreate all existing virtual routers? What about existing ports etc.?

eblock gravatar imageeblock ( 2018-03-09 01:44:20 -0600 )edit

Oh, okay. However take into account that with Linux Bridge + VRRP, all the north-south traffic (be it from a fixed IP or a floating IP) passes through a network node which adds latency. It also introduces an additional point of failure for your instances' traffic, despite being HA.

Peter Slovak gravatar imagePeter Slovak ( 2018-03-09 03:45:33 -0600 )edit

OVS + DVR on the other hand direct N-S traffic with a floating IP directly to/from compute nodes. The SPOF there used to be SNAT traffic from a fixed IP which still passes via network nodes, but the VRRP enhancement solves this. Note that a network node and a controller can be one physical node.

Peter Slovak gravatar imagePeter Slovak ( 2018-03-09 03:48:10 -0600 )edit

I'm not 100% sure on the router recreate, but according to this you have to shut the router down when adding VRRP. I would at least migrate the router to another node to make sure L3 agent recreates it from scratch.

Peter Slovak gravatar imagePeter Slovak ( 2018-03-09 03:51:57 -0600 )edit

Get to know Ask OpenStack

Resources for moderators

Question Tools

1 follower


Asked: 2018-03-08 09:55:35 -0600

Seen: 193 times

Last updated: Jul 15 '20