Sahara cluster fails intermittently

asked 2014-11-11 21:14:18 -0500

belle gravatar image
Icehouse
Sahara 0.7.1
Anti affinity feature enabled on nova by  adding the DifferentHostFilter  on nova.conf (scheduler_default_filters)

Issue:
Cluster failed to start intermittently, we're seeing at least one or two fail clusters when running multiple clusters. It also sometimes fail even with just one.

I noticed that sometime two of the worker's instances ended up running on the same compute node even the anti affinity option is checked on sahara and that's the time the cluster fail to start. I got the following error from the sahara.log and after checking the nova-compute logs, the worker-008 which failed based on the error below landed on node6 along with worker-004, so the cluster failed to start.

2014-11-11 01:18:53.846 8848 ERROR sahara.context [-] Thread 'configure-instance-sccloud01-1415668439-worker-008' fails with exception: 'error: [Errno 113] No route to host'
2014-11-11 01:18:53.846 8848 TRACE sahara.context Traceback (most recent call last):
2014-11-11 01:18:53.846 8848 TRACE sahara.context   File "/opt/stack/venvs/sahara/local/lib/python2.7/site-packages/sahara/context.py", line 128, in _wrapper
2014-11-11 01:18:53.846 8848 TRACE sahara.context     raise SubprocessException(result['exception'])
2014-11-11 01:18:53.846 8848 TRACE sahara.context SubprocessException: error: [Errno 113] No route to host

Does anyone know if this is a bug on the scheduler or am I missing something on the configuration? Any help would be greatly appreciated.

Thanks!

edit retag flag offensive close merge delete