Some instances failed to create or delete

asked 2019-06-06 14:30:00 -0500

Newedge245 gravatar image

Hi Expert's. Im facing very intresting issues. I can't provide logs for now. Using Queens. I have noticed a few instances hang at build state wich are part of stacks that cannot be deleted because of that. Also i cannot delete other stacks becuae some instances hang in delete state. I can delete those by puting them at error state and delete them.

Most stacks created and deleted successfully.... There are deadlocks on the mariadb logs. Also api dberror on novaapi logs. I suspect it's nova-api asking port from neutron neutron get a deadlock from maria maria rollback and the nova- returns error to the user.

Setup include 80 compute nodes. Maybe ita related to scale?

Maybe you have a better theory?

edit retag flag offensive close merge delete

Comments

If something as fundamental as the database doesn't work, it's understandable you have problems. Before anything else, you need to fix the DB problem.

Installations of that size normally have DB HA e.g. Percona or Galera. This also spreads the load over several DB servers. What is your setup?

Bernd Bausch gravatar imageBernd Bausch ( 2019-06-06 19:13:11 -0500 )edit

Thanks Bernd. Its a 3 nodes mariadb with haproxy infront. But the haproxy is not spreading the load its pointing one node and move to the others if its down. My next step is to install prometheus in order to monitor the database activity.

Newedge245 gravatar imageNewedge245 ( 2019-06-06 23:26:52 -0500 )edit

OpenStack services lend themselves to active/active setups (with a few exceptions, such as cinder-volume, which needs to remain active/passive afaik). Galera (or Percona) as well. Good luck measuring the database.

Bernd Bausch gravatar imageBernd Bausch ( 2019-06-07 05:49:56 -0500 )edit

The database is working great. The problem looks like deadlocks in the api for example neutron try to do something for 10 times with sleep of 0.1 second changing the retry to 1000 solve the issue its happening in nova conductor as well. My plan is to upgrade. the code changed in the recent versions

Newedge245 gravatar imageNewedge245 ( 2019-06-15 03:20:03 -0500 )edit

Check your logs (especially nova) for "Too many open files" messages, I just had that in a medium sized cloud. I'm not sure why those messages appered, but as a quick temporary solution restarting nova-compute service on the respective compute nodes resolved that (for now).

eblock gravatar imageeblock ( 2019-06-20 09:29:55 -0500 )edit