Haproxy/Galera shared, cannot connect over VIP
Summary: I have a highly-available database cluster using galera and haproxy over corosync/pacemaker where I can connect using a node's actual IP address, but cannot using its virtual IP.
The long and full explanation, with relevant configuration files
First off, there are some similar problems to be found on the internet, although not the exact configuration I have. The existing most-similar question/answer to mine is this one: https://ask.openstack.org/en/question/25868/ha-not-able-to-connect-with-virtualip/ (https://ask.openstack.org/en/question...)
There are some subtle differences;
My configuration: 3 servers running as controllers; running all the openstack services on bare metal. That includes haproxy, corosync, and pacemaker. E.g. the database hosts are also the haproxy hosts.
(We want high-availability and no split-brain risk; but have only 5 available machines).
I'm following the default installation guide under https://docs.openstack.org/ha-guide/, installing the current stable version of Openstack on 5 machines running debian-9.
We have a vlan-capable switch so additional networks beyond the two NICs available to each machine can be done this way.
Machines have a network set up for haproxy; 10.0.44.0/24. the IP 10.0.44.250 was set as a virtual (VIP) address. I can connect from either controller (10.0.44.1, 10.0.44.2, 10.0.44.5) to 10.0.44.250 and verify that it’s currently set as the first machine. I can SSH to it as well, modify a file, and check that this succeeds. I have a working, running Galera cluster. I can connect with say
mysql –h 10.0.44.1 –D keystone –u keystone –p –P 3306.
This works from all machines. (Already implemented part of the ‘keystone config’ from the HA guide). I can connect and view my empty keystone database and do operations on it. These get executed on all cluster nodes.
However, once I try to do this:
mysql –h 10.0.44.250 –D keystone –u keystone –p –P 3306.
This error will occur:
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 0 "Internal error/check (Not system error)"
Which apparently is some sort of standard 'I could not connect' error. It supplies a reason with the flag constant, but for my case it's 0, or 'sorry, we don't know why'.
We can run some additional shell code to do some checks. Here's some additional information;
root@st01:/etc/mysql/mariadb.conf.d# telnet 10.0.44.1 3306
Trying 10.0.44.1...
Connected to 10.0.44.1.
Escape character is '^]'.
5.5.5-10.1.26-MariaDB-0+deb9u1<sB.]l'xJ-?▒GfL)&#}m**1Xmysql_native_password
^C Connection closed by foreign host.
root@st01:/etc/mysql/mariadb.conf.d# telnet 10.0.44.250 3306
Trying 10.0.44.250...
Connected to 10.0.44.250.
Escape character is '^]'.
Connection closed by foreign host.
root@st01:/etc/mysql/mariadb.conf.d# ip route get 10 ...