Galera: Recover a complete environment failure

asked 2020-02-28 07:49:37 -0500

CKi

updated 2020-02-28 16:16:16 -0500

Hi All,

Two weeks ago one of our switches failed. As a result, all three nodes of our cluster have gone offline. Now they are online again, but you can't log in on the OpenStack website anymore. The website takes a long time to load when you try to log in and then returns 504 Gateway Time-Out. According to the logs, there are problems with Keystone. Further troubleshooting has led me to Galera. I can see that none of our nodes is running the MariaDB service. (According to docs), we have a "complete environment failure" because cat /var/lib/mysql/grastate.dat returns seqno -1 on all nodes.

How can I recover from that?

You'll have to bootstrap the galera cluster again, how is it managed? If it's a pacemaker env it should recover by itself if you cleanup failed resources. If you have to do it manually I would first try galera_recover(or similar). To do that you'll probably have to edit the grastate.datfile...

eblock ( 2020-02-28 12:57:14 -0500 )

the mysql log usually should mention something like that (I don't have a cluster at hand right now).

eblock ( 2020-02-28 12:58:34 -0500 )

Already happened here , we used the kolla-ansible playbook to recover the cluster . Even if you don't have Kolla, you can follow the playbook :

chalans ( 2020-02-28 15:49:21 -0500 )

@eblock Thanks for the suggestion. mysql just returns ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111 "Connection refused"). /etc/mysql/my.cnf mentions a log named /var/log/mysql_logs/galera_server_error.log but this file is empty

CKi ( 2020-02-28 16:20:18 -0500 )

Of course mysql returns an error since galera is not running. But depending on your config you should find something in one of the logs. In my environment galera writes to /var/log/mysql/mysqld.log.

eblock ( 2020-03-02 02:54:37 -0500 )