Ask Your Question
2

Juno Sahara Problem Instantiating a 2-node Spark Cluster

asked 2014-12-31 03:01:12 -0500

Nastooh gravatar image

updated 2015-01-04 12:33:51 -0500

Trying to instantiate a 2-node cluster, based on Ubuntu 13.10 image containing Spark 1.0.0. Both controller and worker nodes are instantiated properly; however, it seems that during cluster startup timeout occurs:

… 
2014-12-31 08:19:30.225 5880 DEBUG sahara.utils.ssh_remote [-] [t-controller-001] _execute_command took 300.0 seconds to complete _log_command /home/navesta/OpenStack/sahara-venv/local/lib/python2.7/site-packages/sahara/utils/ssh_remote.py:622
2014-12-31 08:19:33.057 5880 DEBUG sahara.openstack.common.periodic_task [-] Running periodic task SaharaPeriodicTasks.update_job_statuses run_periodic_tasks /home/navesta/OpenStack/sahara-venv/local/lib/python2.7/site-packages/sahara/openstack/common/periodic_task.py:199
2014-12-31 08:19:33.057 5880 DEBUG sahara.service.periodic [-] Updating job statuses update_job_statuses /home/navesta/OpenStack/sahara-venv/local/lib/python2.7/site-packages/sahara/service/periodic.py:75
2014-12-31 08:19:33.064 5880 DEBUG sahara.openstack.common.loopingcall [-] Dynamic looping call <bound method SaharaPeriodicTasks.run_periodic_tasks of <sahara.service.periodic.SaharaPeriodicTasks object at 0x7f4c32259c50>> sleeping for 45.00 seconds _inner /home/navesta/OpenStack/sahara-venv/local/lib/python2.7/site-packages/sahara/openstack/common/loopingcall.py:132
2014-12-31 08:19:35.310 5880 ERROR sahara.service.ops [-] Error during operating cluster 't' (reason: Operation timed out after 300 second(s)
Error ID: c9629593-5382-42fa-bc4e-8636c7585163)
2014-12-31 08:19:35.310 5880 TRACE sahara.service.ops Traceback (most recent call last):
2014-12-31 08:19:35.310 5880 TRACE sahara.service.ops   File "/home/navesta/OpenStack/sahara-venv/local/lib/python2.7/site-packages/sahara/service/ops.py", line 141, in wrapper
2014-12-31 08:19:35.310 5880 TRACE sahara.service.ops     f(cluster_id, *args, **kwds)
2014-12-31 08:19:35.310 5880 TRACE sahara.service.ops   File "/home/navesta/OpenStack/sahara-venv/local/lib/python2.7/site-packages/sahara/service/ops.py", line 235, in _provision_cluster
2014-12-31 08:19:35.310 5880 TRACE sahara.service.ops     plugin.start_cluster(cluster)
2014-12-31 08:19:35.310 5880 TRACE sahara.service.ops   File "/home/navesta/OpenStack/sahara-venv/local/lib/python2.7/site-packages/sahara/plugins/spark/plugin.py", line 123, in start_cluster
2014-12-31 08:19:35.310 5880 TRACE sahara.service.ops     run.start_spark_master(r, self._spark_home(cluster))
2014-12-31 08:19:35.310 5880 TRACE sahara.service.ops   File "/home/navesta/OpenStack/sahara-venv/local/lib/python2.7/site-packages/sahara/plugins/spark/run_scripts.py", line 52, in start_spark_master
2014-12-31 08:19:35.310 5880 TRACE sahara.service.ops     "sbin/start-all.sh"))
2014-12-31 08:19:35.310 5880 TRACE sahara.service.ops   File "/home/navesta/OpenStack/sahara-venv/local/lib/python2.7/site-packages/sahara/utils/ssh_remote.py", line 574, in execute_command
2014-12-31 08:19:35.310 5880 TRACE sahara.service.ops     get_stderr, raise_when_error)
2014-12-31 08:19:35.310 5880 TRACE sahara.service.ops   File "/home/navesta/OpenStack/sahara-venv/local/lib/python2.7/site-packages/sahara/utils/ssh_remote.py", line 643, in _run_s
2014-12-31 08:19:35.310 5880 TRACE sahara.service.ops     return self._run_with_log(func, timeout, *args, **kwargs)
2014-12-31 08:19:35.310 5880 TRACE sahara.service.ops   File "/home/navesta/OpenStack/sahara-venv/local/lib/python2.7/site-packages/sahara/utils/ssh_remote.py", line 517, in _run_with_log
2014-12-31 08 ...
(more)
edit retag flag offensive close merge delete

1 answer

Sort by » oldest newest most voted
1

answered 2015-01-03 14:44:10 -0500

Nastooh gravatar image

Problem was due to inability of the controller node to ssh to its worker nodes. This in turn was due to size of mtu on vms, which was 1700. Changing this value to 1500 resolved the problem.

edit flag offensive delete link more

Comments

Could you share us the detailed analysis on how you find it is the MTU issue between the vms? Thanks for your contribution!

9lives gravatar image9lives ( 2015-01-03 18:33:56 -0500 )edit

See Edit 1 - not enough room.

Nastooh gravatar imageNastooh ( 2015-01-04 12:31:30 -0500 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Get to know Ask OpenStack

Resources for moderators

Question Tools

2 followers

Stats

Asked: 2014-12-31 03:01:12 -0500

Seen: 334 times

Last updated: Jan 04 '15