Database access in High Availability mode

asked 2018-07-26 20:11:30 -0600

codylab gravatar image

updated 2018-07-26 22:06:15 -0600

I have been trying to setup a 3-controller based HA prototype based on the Queens release. So far I have got the following items (appear to be) working:

  1. Galera cluster with Mariadb
  2. RabbitMQ with HA Queue
  3. HAProxy
  4. Pacemaker resource management for VIP, HAProxy, and OpenStack-keystone API

The installation is mainly based on the official OpenStack HA guide. Some changes were made for CentOS 7, as follows:

  1. For the Galera cluster, use wsrep_provider="/usr/lib64/galera/libgalera_smm.so" # the file path in the guide does not work for CentOS 7.
  2. For the Pacemaker cluster resource, there is no openstack-keystone.service for RHEL/CentOS distro. Since the keystone service relies on the Apache wsgi, a work-around is to use pcs resource create openstack-keystone systemd:httpd --clone interleave=true
  3. The following two settings in /etc/keystone/keystone.conf from the guide appear to be deprecated and do not work in Queens:

    [catalog] driver = keystone.catalog.backends.sql.Catalog

    [identity] driver = keystone.identity.backends.sql.Identity

    So I used the default value "sql"

Problem:

Although the Galera database cluster appears to be working (i.e. sync works), I get a following error when trying to access the database with openstack cli (e.g. openstack project list).

Unable to establish connection to http://192.168.10.10:35357/v3/auth/tokens (http://192.168.10.10:35357/v3/auth/to...): ('Connection aborted.', BadStatusLine("''",))

note: 192.168.10.10 is the cluster VIP and the port is open and reachable.

The log from /var/log/keystone/keystone.log shows the following error:

2018-07-26 20:29:28.905 27063 WARNING oslo_db.sqlalchemy.engines [req-1ad06ad5-92a6-4a4c-9488-d805ba6f1688 - - - - -] SQL connection failed. 10 attempts left.: DBConnectionError: (_mysql_exceptions.OperationalError) (2013, 'Lost connection to MySQL server at \'reading initial communication packet\', system error: 0 "Internal error/check (Not system error)"') (Background on this error at: http://sqlalche.me/e/e3q8)

### 9 more attempts... logs omitted... ###

2018-07-26 20:31:09.016 27063 ERROR keystone.common.wsgi [req-1ad06ad5-92a6-4a4c-9488-d805ba6f1688 - - - - -] (_mysql_exceptions.OperationalError) (2013, 'Lost connection to MySQL server at \'reading initial communication packet\', system error: 0 "Internal error/check (Not system error)"') (Background on this error at: http://sqlalche.me/e/e3q8): DBConnectionError: (_mysql_exceptions.OperationalError) (2013, 'Lost connection to MySQL server at \'reading initial communication packet\', system error: 0 "Internal error/check (Not system error)"') (Background on this error at: http://sqlalche.me/e/e3q8)

I can verify that the database is running and can be accessed via mysql command line. Any changes can be successfully synchronized to other nodes with Galera. I have already disabled the firewall and selinux for the sake of experiment. What could be wrong in this case? Where should I start to debug?

edit retag flag offensive close merge delete

Comments

Just to make sure: the database is accessible via CLI on the VIP? I remember having some troubles, too. How is your HAProxy config regarding Galera?

eblock gravatar imageeblock ( 2018-07-27 02:02:46 -0600 )edit

@eblock Thank you for shedding light on this! CLI access to the VIP returned the same error. After changing the haproxy.cfg now it is working! I can't list out the full HAProxy Galera section due to length limit, but I removed the part port 9200 for all three server lines.

codylab gravatar imagecodylab ( 2018-07-27 08:29:11 -0600 )edit

As of Queens release, are we able to use active/active for database now? Or do we still have to stick with active/passive to avoid the dead lock issue?

codylab gravatar imagecodylab ( 2018-07-27 08:36:58 -0600 )edit

Great, I'm glad it works! I also had to modify the haproxy.cfg regarding Galera, the guide didn't work well for me.

eblock gravatar imageeblock ( 2018-07-30 02:32:42 -0600 )edit

I configured Galera in active/active mode and then ran some quick tests (Pike release) in my lab HA cluster, it worked very well. Of course, the dead lock issues would only occur if you generate a high load with multiple instances and networks etc.

eblock gravatar imageeblock ( 2018-07-30 02:34:04 -0600 )edit