bonding vlan haproxy juno problem!

asked 2014-11-26 06:01:37 -0600

Hi, I'm having trouble setting this up!

I have 7 servers: 2 controllers with 6 eth, 3 compute with 6 eth and two storage with 6 eth.

I bonded all 6 eth with 802.3ad layer2+3 miimon 100 lacp-rate 1 that are connected to a hp 2530-48G switch with trunks for every node set up with lacp manual.

On top of the bonding i have set up VLAN's as follows:

  1. Controller nodes have 3 VLAN's : bond0.11(management network), bond0.12(tunnel network) and bond0.21(external-network)
  2. Compute nodes: the same as above
  3. Storage nodes: one VLAN bond0.11 for management network.

I tagged all vlans on the switch.

I am using neutron with dvr. For tunneling tenant networks i use vxlan.

On the storage servers i have set up haproxy with keepalived for a virtual ip set up on bond0.11

All openstack services connect to haproxy that in turn load balance it to the designated hosts.

I have a galera mariadb cluster on both controllers.

The problem that i am facing is the following:

After a fresh reboot all is working well. But after a few minutes the system goes wrong. The dashboard is almost unuseable, im getting errors like retreving information, error retreiving image list, volume list etc.

If i log on, let's say, controller1, source admin credentials and issue a command like glance image-list or any other command, sometimes it works, other time i get http500 status. The errors don't have a pattern. Sometimes i get an error other times i don't.

I am thinking on changing vlana and bonding mtu from 1500 to 1522 to acomodate the vlan header.

If anyone has an answer to this please respond because i ran out of ideas.

PS: i tried different types of bonding with same outcome!

PS2: On bond0 i can see dropped rx packages, more on the compute nodes.


I managed to narrow it down to mysql after looking deeper into logs. I am getting mysql server has gone away error. I increased wait_timeout from the default 600 seconds to 28800seconds. The problem is why do connection linger more than 10 minutes? After 10 min mysql cuts the connection and then i get an error on the dashboard and from cli!

My second question is why do openstack services keep trying to connect to the same connection on mysql and why does it take couple of retries to connect back to the database after connection being cut?

I will see what happens after the 8h have passed. The problem with this setup is that the open connections to the database will pile up in 8h time!

It is strange because in normal openstack setup i haven't had this problem. Only when using active/active HA.

I am using mariadb 5.5 galera cluster. Services connect directly to the database and not through HAproxy because i was having more connection trouble when mariadb was set up throught haproxy!

answered 2014-12-02 07:11:10 -0600

is the time synchronized on all nodes ?

yes...The 2 controllers have ntp set from the same internet serevr and all other nodes have ntp set to the two controllers

