Sahara cluster fails intermittently
Icehouse
Sahara 0.7.1
Anti affinity feature enabled on nova by adding the DifferentHostFilter on nova.conf (scheduler_default_filters)
Issue:
Cluster failed to start intermittently, we're seeing at least one or two fail clusters when running multiple clusters. It also sometimes fail even with just one.
I noticed that sometime two of the worker's instances ended up running on the same compute node even the anti affinity option is checked on sahara and that's the time the cluster fail to start. I got the following error from the sahara.log and after checking the nova-compute logs, the worker-008 which failed based on the error below landed on node6 along with worker-004, so the cluster failed to start.
2014-11-11 01:18:53.846 8848 ERROR sahara.context [-] Thread 'configure-instance-sccloud01-1415668439-worker-008' fails with exception: 'error: [Errno 113] No route to host'
2014-11-11 01:18:53.846 8848 TRACE sahara.context Traceback (most recent call last):
2014-11-11 01:18:53.846 8848 TRACE sahara.context File "/opt/stack/venvs/sahara/local/lib/python2.7/site-packages/sahara/context.py", line 128, in _wrapper
2014-11-11 01:18:53.846 8848 TRACE sahara.context raise SubprocessException(result['exception'])
2014-11-11 01:18:53.846 8848 TRACE sahara.context SubprocessException: error: [Errno 113] No route to host
Does anyone know if this is a bug on the scheduler or am I missing something on the configuration? Any help would be greatly appreciated.
Thanks!