Ask Your Question
0

Galera breaks then all the controllers go down

asked 2017-03-01 00:21:37 -0600

JD_Marks gravatar image

These controller nodes have not been rebooted or stopped. However, we install cluster, get them running and after a few weeks of use, galera gets out of sync, I think, and we get into this mode. The entire stack stops working. Can anyone give us any idea how we can get to the bottom of this? We have seen this on multiple clusters on Liberty and Mitaka.

[root@overcloud-controller-1 log]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-1 (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
Last updated: Wed Mar  1 06:01:15 2017          Last change: Wed Mar  1 05:50:30 2017 by hacluster via crmd on    overcloud-controller-0

2 nodes and 87 resources configured

Online: [ overcloud-controller-1 ]
OFFLINE: [ overcloud-controller-0 ]

Full list of resources:

 ip-172.16.0.10 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
 ip-172.18.0.10 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
 ip-10.1.32.80  (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
     Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-1 ]
     Stopped: [ overcloud-controller-0 ]
 Master/Slave Set: galera-master [galera]
     Slaves: [ overcloud-controller-1 ]
     Stopped: [ overcloud-controller-0 ]
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-controller-1 ]
     Stopped: [ overcloud-controller-0 ]


ip-172.16.0.11 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-1 ]
     Stopped: [ overcloud-controller-0 ]
     Clone Set: openstack-core-clone [openstack-core]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-1 ]
     Stopped: [ overcloud-controller-0 ]
 ip-172.22.0.22 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
 ip-172.19.0.10 (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
 Clone Set: mongod-clone [mongod]
     Started: [ overcloud-controller-1 ]
     Stopped: [ overcloud-controller-0 ]
 Clone Set: openstack-aodh-evaluator-clone [openstack-aodh-evaluator]
 Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup]
     Started: [ overcloud-controller-1 ]
     Stopped: [ overcloud-controller-0 ]
 Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]
 Started: [ overcloud-controller-1 ]
 Stopped: [ overcloud-controller-0 ]
openstack-cinder-volume        (systemd:openstack-cinder-volume):      Stopped
 Clone Set: openstack-heat-engine-clone [openstack-heat-engine]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-aodh-listener-clone [openstack-aodh-listener]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-gnocchi-metricd-clone [openstack-gnocchi-metricd]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-aodh-notifier-clone [openstack-aodh-notifier]
 Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-glance-api-clone [openstack-glance-api]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-nova-api-clone [openstack-nova-api]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth]
 Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-sahara-api-clone [openstack-sahara-api]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-sahara-engine-clone [openstack-sahara-engine]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-glance-registry-clone [openstack-glance-registry]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-gnocchi-statsd-clone [openstack-gnocchi-statsd]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification]
 Stopped: [ overcloud-controller-0 overcloud-controller-1 ]


Clone Set: openstack-cinder-api-clone [openstack-cinder-api]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: delay-clone [delay]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: neutron-server-clone [neutron-server]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]

 Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: httpd-clone [httpd]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]
 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 ]

Failed Actions:
* openstack-heat-engine_start_0 on overcloud-controller-1 'not running' (7): call=303, status=complete, exitreason='none',
    last-rc-change='Mon Feb  6 23:19:13 2017', queued=0ms, exec ...
(more)
edit retag flag offensive close merge delete

Comments

You don't have a quorum. I expect pcs resource cleanup to relay on quorum set up (3,5,.. controllers in cluster )
I just guess due to pcs resource cleanup fix issues on 3 Node PCS/Corosync Comtrollers Clusters for myself.

dbaxps gravatar imagedbaxps ( 2017-03-01 08:35:22 -0600 )edit

1 answer

Sort by ยป oldest newest most voted
1

answered 2017-03-01 09:23:38 -0600

dbaxps gravatar image

updated 2017-03-02 15:23:13 -0600

UPDATE 03/01/17 19:41 MSK
Quoting http://stackoverflow.com/questions/23...

The main reason that three servers is the recommended minimum is to increase the chance that a quorum will exist in the event of a network problem. If a cluster has two nodes (or more generally any even number of nodes) a single network link failure could cause the cluster to pause since it could create two partitions with half of the nodes, neither having a quorum. An odd number of nodes means that a single network link failure cannot cause the cluster to pause since there will always be quorum. If there is more than one network link failure, however, things can get more complicated but only a partition with a quorum will function normally.

See also Galera PCS Cluster Recover

END UPDATE

Just a guess regarding your issue :-
Report above :-

[root@overcloud-controller-1 log]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-1 (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
Last updated: Wed Mar  1 06:01:15 2017          Last change: Wed Mar  1 05:50:30 2017 by hacluster via crmd on    overcloud-controller-0

However,

In case when  quorum enforcement is enabled, and one of the two nodes fails, then the remaining node can not establish a majority of quorum votes necessary to run services, and thus it is unable to take over any resources.
edit flag offensive delete link more

Comments

What I am more concerned about is why this keeps happening. We are currently in test and are starting commercial deployment. This can't happen in production. Thanks for the answer. I will give the recovery a shot.

JD_Marks gravatar imageJD_Marks ( 2017-03-04 09:36:55 -0600 )edit

I did the very first tripleo (Newton) deployment with 3 Nodes PCS/Corosync HA Controllers and several times got pcs status reporting galera_monitor failures ,during several days time frame. Just ran pcs resource cleanup , e.g. restarting all resources

dbaxps gravatar imagedbaxps ( 2017-03-04 11:49:44 -0600 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Get to know Ask OpenStack

Resources for moderators

Question Tools

1 follower

Stats

Asked: 2017-03-01 00:21:37 -0600

Seen: 372 times

Last updated: Mar 02 '17