Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

openstack kolla - HA databases unable to start

I've installed an 8 node cloud with openstack kolla ansible, with dual controllers.

It all works fine, until I stop and start it. (Which is related to my earlier question on stopping a cloud)

On boot, the mariadb/galera components dont seem to be able to start/sync.

I started the first controller and maria db log says: (from the previous shutdown)

2018-06-08 17:26:00 140662590556928 [Note] InnoDB: Online DDL : Completed
2018-06-08 20:12:08 140664414825216 [Note] /usr/libexec/mysqld: Normal shutdown

2018-06-08 20:12:08 140664414825216 [Note] WSREP: Stop replication
2018-06-08 20:12:08 140664414825216 [Note] WSREP: Closing send monitor...
2018-06-08 20:12:08 140664414825216 [Note] WSREP: Closed send monitor.
2018-06-08 20:12:11 140664414825216 [Note] WSREP: gcomm: terminating thread
2018-06-08 20:12:11 140664414825216 [Note] WSREP: gcomm: joining thread
2018-06-08 20:12:11 140664414825216 [Note] WSREP: gcomm: closing backend
2018-06-08 20:12:15 140664414825216 [Note] WSREP: (958d6a22, 'tcp://172.16.101.100:4567') turning message relay requesting on, nonlive peers: tcp://172.16.101.109:4567
2018-06-08 20:12:16 140664414825216 [Note] WSREP: (958d6a22, 'tcp://172.16.101.100:4567') reconnecting to 8361901d (tcp://172.16.101.109:4567), attempt 0
180611 08:44:29 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql/
180611 08:44:29 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql//wsrep_recovery.zFLp3L' --pid-file='/var/lib/mysql//ctrl1.lab1.stack-recover.pid'
2018-06-11  8:44:29 139883731843264 [Note] /usr/libexec/mysqld (mysqld 10.1.20-MariaDB) starting as process 184 ...
180611 08:44:51 mysqld_safe WSREP: Recovered position 6f86c600-6b10-11e8-97f6-9b647609adad:43931
2018-06-11  8:44:52 140457847679168 [Note] /usr/libexec/mysqld (mysqld 10.1.20-MariaDB) starting as process 220 ...
2018-06-11  8:44:52 140457847679168 [Note] WSREP: Read nil XID from storage engines, skipping position init
2018-06-11  8:44:52 140457847679168 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
2018-06-11  8:44:52 140457847679168 [Note] WSREP: wsrep_load(): Galera 3.16(r5c765eb) by Codership Oy <info@codership.com> loaded successfully.
2018-06-11  8:44:52 140457847679168 [Note] WSREP: CRC-32C: using "slicing-by-8" algorithm.
2018-06-11  8:44:52 140457847679168 [Note] WSREP: Found saved state: 6f86c600-6b10-11e8-97f6-9b647609adad:-1
2018-06-11  8:44:52 140457847679168 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 172.16.101.100; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://172.16.101.100:4567; gmcast.segment = 0; gmcast.version = 0; ist.recv_addr = 172.16.1
2018-06-11  8:44:52 140457752131328 [Note] WSREP: Service thread queue flushed.
2018-06-11  8:44:52 140457847679168 [Note] WSREP: Assign initial position for certification: 43931, protocol version: -1
2018-06-11  8:44:52 140457847679168 [Note] WSREP: wsrep_sst_grab()
2018-06-11  8:44:52 140457847679168 [Note] WSREP: Start replication
2018-06-11  8:44:52 140457847679168 [Note] WSREP: Setting initial position to 6f86c600-6b10-11e8-97f6-9b647609adad:43931
2018-06-11  8:44:52 140457847679168 [Note] WSREP: protonet asio version 0
2018-06-11  8:44:52 140457847679168 [Note] WSREP: Using CRC-32C for message checksums.
2018-06-11  8:44:52 140457847679168 [Note] WSREP: backend: asio
2018-06-11  8:44:52 140457847679168 [Note] WSREP: gcomm thread scheduling priority set to other:0
2018-06-11  8:44:52 140457847679168 [Note] WSREP: restore pc from disk successfully
2018-06-11  8:44:52 140457847679168 [Note] WSREP: GMCast version 0
2018-06-11  8:44:52 140457847679168 [Note] WSREP: (958d6a22, 'tcp://172.16.101.100:4567') listening at tcp://172.16.101.100:4567
2018-06-11  8:44:52 140457847679168 [Note] WSREP: (958d6a22, 'tcp://172.16.101.100:4567') multicast: , ttl: 1
2018-06-11  8:44:52 140457847679168 [Note] WSREP: EVS version 0
2018-06-11  8:44:52 140457847679168 [Note] WSREP: gcomm: connecting to group 'openstack', peer '172.16.101.100:4567,172.16.101.109:4567'
2018-06-11  8:44:55 140457847679168 [Warning] WSREP: no nodes coming from prim view, prim not possible
2018-06-11  8:44:55 140457847679168 [Note] WSREP: view(view_id(NON_PRIM,958d6a22,6) memb {
        958d6a22,0
} joined {
} left {
} partitioned {
})
2018-06-11  8:44:55 140457847679168 [Note] WSREP: gcomm: connected
2018-06-11  8:44:55 140457847679168 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2018-06-11  8:44:55 140457847679168 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2018-06-11  8:44:55 140457847679168 [Note] WSREP: Opened channel 'openstack'
2018-06-11  8:44:55 140457847679168 [Note] WSREP: Waiting for SST to complete.
2018-06-11  8:44:55 140457501775616 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2018-06-11  8:44:55 140457501775616 [Note] WSREP: Flow-control interval: [16, 16]
2018-06-11  8:44:55 140457501775616 [Note] WSREP: Received NON-PRIMARY.
2018-06-11  8:44:55 140457846037248 [Note] WSREP: New cluster view: global state: 6f86c600-6b10-11e8-97f6-9b647609adad:43931, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version -1
2018-06-11  8:44:55 140457510168320 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.53274S), skipping check
2018-06-11  8:46:55 140457510168320 [Note] WSREP: (958d6a22, 'tcp://172.16.101.100:4567') turning message relay requesting on, nonlive peers:
2018-06-11  8:46:56 140457510168320 [Note] WSREP: declaring 3b0c00da at tcp://172.16.101.109:4567 stable
2018-06-11  8:46:56 140457510168320 [Warning] WSREP: no nodes coming from prim view, prim not possible
2018-06-11  8:46:56 140457510168320 [Note] WSREP: view(view_id(NON_PRIM,3b0c00da,7) memb {
        3b0c00da,0
        958d6a22,0
} joined {
} left {
} partitioned {
})
2018-06-11  8:46:56 140457501775616 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 1, memb_num = 2
2018-06-11  8:46:56 140457501775616 [Note] WSREP: Flow-control interval: [23, 23]
2018-06-11  8:46:56 140457501775616 [Note] WSREP: Received NON-PRIMARY.
2018-06-11  8:46:56 140457846037248 [Note] WSREP: New cluster view: global state: 6f86c600-6b10-11e8-97f6-9b647609adad:43931, view# -1: non-Primary, number of nodes: 2, my index: 1, protocol version -1

And then I boot the second controller, and the complaints start...

2018-06-08 20:12:11 140403685165824 [Note] /usr/libexec/mysqld: Normal shutdown

2018-06-08 20:12:11 140403685165824 [Note] WSREP: Stop replication
2018-06-08 20:12:11 140403685165824 [Note] WSREP: Closing send monitor...
2018-06-08 20:12:11 140403685165824 [Note] WSREP: Closed send monitor.
2018-06-08 20:12:11 140403685165824 [Note] WSREP: gcomm: terminating thread
2018-06-08 20:12:11 140403685165824 [Note] WSREP: gcomm: joining thread
2018-06-08 20:12:11 140403685165824 [Note] WSREP: gcomm: closing backend
2018-06-08 20:12:12 140403685165824 [Warning] WSREP: last inactive check more than PT1.5S ago (PT1.50901S), skipping check
2018-06-08 20:12:12 140403685165824 [Note] WSREP: forgetting 958d6a22 (tcp://172.16.101.100:4567)
2018-06-08 20:12:12 140403685165824 [Note] WSREP: Node 8361901d state prim
2018-06-08 20:12:12 140403685165824 [Note] WSREP: view(view_id(PRIM,8361901d,5) memb {
        8361901d,0
} joined {
} left {
} partitioned {
        958d6a22,0
})
2018-06-08 20:12:12 140403685165824 [Note] WSREP: save pc into disk
2018-06-08 20:12:12 140403685165824 [Note] WSREP: forgetting 958d6a22 (tcp://172.16.101.100:4567)
2018-06-08 20:12:12 140403685165824 [Note] WSREP: view((empty))
2018-06-08 20:12:12 140409816389376 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
2018-06-08 20:12:12 140403685165824 [Note] WSREP: gcomm: closed
2018-06-08 20:12:12 140409816389376 [Warning] WSREP: 0x55fe6cde9a40 down context(s) not set
2018-06-08 20:12:12 140409816389376 [Warning] WSREP: Failed to send state UUID: -107 (Transport endpoint is not connected)
2018-06-08 20:12:12 140409816389376 [Note] WSREP: Received self-leave message.
2018-06-08 20:12:12 140409816389376 [Note] WSREP: Flow-control interval: [0, 0]
2018-06-08 20:12:12 140409816389376 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2018-06-08 20:12:13 140409816389376 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 43931)
2018-06-08 20:12:13 140409816389376 [Note] WSREP: RECV thread exiting 0: Success
2018-06-08 20:12:13 140403685165824 [Note] WSREP: recv_thread() joined.
2018-06-08 20:12:13 140403685165824 [Note] WSREP: Closing replication queue.
2018-06-08 20:12:13 140403685165824 [Note] WSREP: Closing slave action queue.
2018-06-08 20:12:13 140410156686080 [Note] WSREP: New cluster view: global state: 6f86c600-6b10-11e8-97f6-9b647609adad:43931, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 3
2018-06-08 20:12:13 140410156686080 [Note] WSREP: applier thread exiting (code:0)
2018-06-08 20:12:13 140409393486592 [Note] WSREP: applier thread exiting (code:6)
2018-06-08 20:12:13 140403686087424 [Note] WSREP: applier thread exiting (code:6)
2018-06-08 20:12:13 140403685780224 [Note] WSREP: applier thread exiting (code:6)
2018-06-08 20:12:15 140410156993280 [Note] WSREP: rollbacker thread exiting
2018-06-08 20:12:15 140403685165824 [Note] Event Scheduler: Purging the queue. 0 events
2018-06-08 20:12:15 140403685165824 [Note] WSREP: dtor state: CLOSED
2018-06-08 20:12:15 140403685165824 [Note] WSREP: mon: entered 43931 oooe fraction 0 oool fraction 0
2018-06-08 20:12:15 140403685165824 [Note] WSREP: mon: entered 43931 oooe fraction 0.0228313 oool fraction 0
2018-06-08 20:12:15 140403685165824 [Note] WSREP: mon: entered 44512 oooe fraction 0 oool fraction 2.24659e-05
2018-06-08 20:12:15 140403685165824 [Note] WSREP: cert index usage at exit 0
2018-06-08 20:12:15 140403685165824 [Note] WSREP: cert trx map usage at exit 115
2018-06-08 20:12:15 140403685165824 [Note] WSREP: deps set usage at exit 0
2018-06-08 20:12:15 140403685165824 [Note] WSREP: avg deps dist 23.6184
2018-06-08 20:12:15 140403685165824 [Note] WSREP: avg cert interval 0.0276115
2018-06-08 20:12:15 140403685165824 [Note] WSREP: cert index size 125
2018-06-08 20:12:15 140409858352896 [Note] WSREP: Service thread queue flushed.
2018-06-08 20:12:15 140403685165824 [Note] WSREP: wsdb trx map usage 0 conn query map usage 0
2018-06-08 20:12:15 140403685165824 [Note] WSREP: MemPool(LocalTrxHandle): hit ratio: 0, misses: 0, in use: 0, in pool: 0
2018-06-08 20:12:15 140403685165824 [Note] WSREP: MemPool(SlaveTrxHandle): hit ratio: 0.99695, misses: 134, in use: 0, in pool: 134
2018-06-08 20:12:15 140403685165824 [Note] WSREP: Shifting CLOSED -> DESTROYED (TO: 43931)
2018-06-08 20:12:15 140403685165824 [Note] WSREP: Flushing memory map to disk...
2018-06-08 20:12:15 140403663230720 [Note] InnoDB: FTS optimize thread exiting.
2018-06-08 20:12:15 140403685165824 [Note] InnoDB: Starting shutdown...
2018-06-08 20:12:16 140403685165824 [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool
2018-06-08 20:12:18 140403685165824 [Note] InnoDB: Shutdown completed; log sequence number 62731194
2018-06-08 20:12:18 140403685165824 [Note] /usr/libexec/mysqld: Shutdown complete

180608 20:12:18 mysqld_safe mysqld from pid file /var/lib/mysql/mariadb.pid ended
180611 08:46:44 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql/
180611 08:46:45 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql//wsrep_recovery.wUzxIp' --pid-file='/var/lib/mysql//ctrl2.lab1.stack-recover.pid'
2018-06-11  8:46:45 139745544124608 [Note] /usr/libexec/mysqld (mysqld 10.1.20-MariaDB) starting as process 182 ...
180611 08:46:54 mysqld_safe WSREP: Recovered position 6f86c600-6b10-11e8-97f6-9b647609adad:43931
2018-06-11  8:46:54 140699541412032 [Note] /usr/libexec/mysqld (mysqld 10.1.20-MariaDB) starting as process 217 ...
2018-06-11  8:46:55 140699541412032 [Note] WSREP: Read nil XID from storage engines, skipping position init
2018-06-11  8:46:55 140699541412032 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
2018-06-11  8:46:55 140699541412032 [Note] WSREP: wsrep_load(): Galera 3.16(r5c765eb) by Codership Oy <info@codership.com> loaded successfully.
2018-06-11  8:46:55 140699541412032 [Note] WSREP: CRC-32C: using "slicing-by-8" algorithm.
2018-06-11  8:46:55 140699541412032 [Note] WSREP: Found saved state: 6f86c600-6b10-11e8-97f6-9b647609adad:43931
2018-06-11  8:46:55 140699541412032 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 172.16.101.109; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://172.16.101.109:4567; gmcast.segment = 0; gmcast.version = 0; ist.recv_addr = 172.16.1
2018-06-11  8:46:55 140699236296448 [Note] WSREP: Service thread queue flushed.
2018-06-11  8:46:55 140699541412032 [Note] WSREP: Assign initial position for certification: 43931, protocol version: -1
2018-06-11  8:46:55 140699541412032 [Note] WSREP: wsrep_sst_grab()
2018-06-11  8:46:55 140699541412032 [Note] WSREP: Start replication
2018-06-11  8:46:55 140699541412032 [Note] WSREP: Setting initial position to 6f86c600-6b10-11e8-97f6-9b647609adad:43931
2018-06-11  8:46:55 140699541412032 [Note] WSREP: protonet asio version 0
2018-06-11  8:46:55 140699541412032 [Note] WSREP: Using CRC-32C for message checksums.
2018-06-11  8:46:55 140699541412032 [Note] WSREP: backend: asio
2018-06-11  8:46:55 140699541412032 [Note] WSREP: gcomm thread scheduling priority set to other:0
2018-06-11  8:46:55 140699541412032 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
2018-06-11  8:46:55 140699541412032 [Note] WSREP: restore pc from disk failed
2018-06-11  8:46:55 140699541412032 [Note] WSREP: GMCast version 0
2018-06-11  8:46:55 140699541412032 [Note] WSREP: (3b0c00da, 'tcp://172.16.101.109:4567') listening at tcp://172.16.101.109:4567
2018-06-11  8:46:55 140699541412032 [Note] WSREP: (3b0c00da, 'tcp://172.16.101.109:4567') multicast: , ttl: 1
2018-06-11  8:46:55 140699541412032 [Note] WSREP: EVS version 0
2018-06-11  8:46:55 140699541412032 [Note] WSREP: gcomm: connecting to group 'openstack', peer '172.16.101.100:4567,172.16.101.109:4567'
2018-06-11  8:46:55 140699541412032 [Note] WSREP: (3b0c00da, 'tcp://172.16.101.109:4567') turning message relay requesting on, nonlive peers:
2018-06-11  8:46:56 140699541412032 [Note] WSREP: declaring 958d6a22 at tcp://172.16.101.100:4567 stable
2018-06-11  8:46:56 140699541412032 [Warning] WSREP: no nodes coming from prim view, prim not possible
2018-06-11  8:46:56 140699541412032 [Note] WSREP: view(view_id(NON_PRIM,3b0c00da,7) memb {
        3b0c00da,0
        958d6a22,0
} joined {
} left {
} partitioned {
})
2018-06-11  8:46:59 140699541412032 [Note] WSREP: (3b0c00da, 'tcp://172.16.101.109:4567') turning message relay requesting off
2018-06-11  8:47:26 140699541412032 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
         at gcomm/src/pc.cpp:connect():162
2018-06-11  8:47:26 140699541412032 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2018-06-11  8:47:26 140699541412032 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1380: Failed to open channel 'openstack' at 'gcomm://172.16.101.100:4567,172.16.101.109:4567': -110 (Connection timed out)
2018-06-11  8:47:26 140699541412032 [ERROR] WSREP: gcs connect failed: Connection timed out
2018-06-11  8:47:26 140699541412032 [ERROR] WSREP: wsrep::connect(gcomm://172.16.101.100:4567,172.16.101.109:4567) failed: 7
2018-06-11  8:47:26 140699541412032 [ERROR] Aborting

180611 08:47:27 mysqld_safe mysqld from pid file /var/lib/mysql/mariadb.pid ended
180611 08:47:29 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql/
180611 08:47:29 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql//wsrep_recovery.GVbOW6' --pid-file='/var/lib/mysql//ctrl2.lab1.stack-recover.pid'

Now I could start suspecting that controller2 on .109 did not terminate properly, but then again I cant find any clear info on in which order I SHOULD terminate a kolla ansible cloud...

The simple question though, is how I get the the mariadbs happy again, since the 2nd controller is just looping, trying to restart, complaining on "WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)"

Help? :'(

TQ