Ask Your Question
2

How to bring up second network node

asked 2014-10-09 08:30:22 -0500

lukas.pustina gravatar image

We followed the OpenStack High Availability Guide. For all stateless services we use HAproxy. The only services that need special attention are Neutron DHCP agent, Neutron L3 agent, Neutron metadata agent for which we are trying to use an active/passive failover. Unfortunately, this does not work at all with the following observations. Hopefully you guys have some insight to explain them.

Our Setup

We use Icehouse with two controllers also running the network node software stack with OVS. The primary network node is called control01 and the secondary control02. All nodes use Ubuntu 14.04 and ovs-vsctl shows a mesh connecting all network and compute nodes.

Our Observations

1. Stopping Neutron Agents does not work

If we try to stop the Neutron agents on control01, the dnsmasq process remain as well as all dhcp and router namespaces. We check this using ip nets.

After stopping the agents, the network connectivity of running VMs is still okay.

2. Starting Neutron Agents on secondary does not work

If we try to start the Neutron agents on control02, only the dhcp namespaces are created in contrast to control01. Since the configuration of both nodes is identical, we cannot understand, why this happens.

I'm grateful for any insight.

edit retag flag offensive close merge delete

Comments

When you take it down on control01 and bring it up on control02 (L3 agent) can you verify that neutron sees the L3 agent running :

neutron agent-list

There are ways to do this, I just don't remember if we have a special migration script or if it is automatic.

mpetason gravatar imagempetason ( 2014-10-09 10:50:08 -0500 )edit

2 answers

Sort by ยป oldest newest most voted
3

answered 2014-10-14 13:32:46 -0500

SamYaple gravatar image

updated 2014-10-14 13:35:12 -0500

Hello. So let me shed some light on this for you!

Neutron has no built-in mechanism for HA routers in Icehouse and below (Juno has experimental VRRP support). When you schedule a router, it schedules to an l3_agent. You can run multiple l3_agents, but a router can only associate with a single l3_agent.

Again, there is _no_ built-in mechanism to transfer routers from l3_agent to another. This means when an l3_agent is down, the routers are down until that agent comes back up. Period. To solve this problem, there are 3rd party scripts designed to move these routers at the database level from one l3_agent to a seperate l3_agent. At&t has one I would recommend, it works very well.

So the procedure would look like this.

network01 and network02 both have active _different_ l3_agents.
EVENT: network01 l3_agent goes down
TRIGGERS: failover script to moves routers from one agent to a different agent
EVENT: network01 l3_agent comes back up
(Trigger or don't it depends on your configuration)

Also, you can and should run multiple dhcp_agents for the same network. They can be natively HA due to the nature of DHCP and the fact leases are stored in a database. The both race to return the same address, so no conflict.

Do you have any additional questions on this confusing subject?

edit flag offensive delete link more

Comments

Hi Sam, first of all thanks a lot for making this clear. Finally, a definitive statement. It makes sense and explains the behavior I observe. I found the script at GitHub for reference.

lukas.pustina gravatar imagelukas.pustina ( 2014-10-14 13:44:55 -0500 )edit

Anytime! Let know if you have any additional questions about this. It is a very confusing subject when you dont have someone with a flashlight showing you the way. Its not so difficult with some help.

SamYaple gravatar imageSamYaple ( 2014-10-15 09:29:35 -0500 )edit
0

answered 2014-12-01 08:28:54 -0500

first of all you need 3 controller nodes in order to have ha. (in pacemaker is sort of a must! you need 3 servers to have quorum, and if you have 2 contollers if one node fail pacemaker does not make a failover!).

And about stopping the neutron agents, if netns show you some output it's probably because the ovs-cleanup script didnt ran.( i think there is a bug on that)

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Get to know Ask OpenStack

Resources for moderators

Question Tools

2 followers

Stats

Asked: 2014-10-09 08:30:22 -0500

Seen: 1,067 times

Last updated: Oct 14 '14