Ask Your Question
0

Error with more than one cinder-volume

asked 2013-05-23 08:00:21 -0500

rmm0811 gravatar image

I download the grizzly version openstack and deployed it on two nodes(node1 and dev202) node1: cinder-api cinder-scheduler cinder-volume, keystone, mysql, qpid node2: only cinder-volume

rpc_backend = cinder.openstack.common.rpc.impl_qpid qpid_durable_queues = True

qpid-tool watch the queue status (list queue active): 252353 07:50:56 - 248315.cinder-scheduler 252354 07:50:56 - 248315.cinder-scheduler.node1 252355 07:50:56 - 248315.cinder-scheduler_fanout_e0ef7071e1b744769df5f06dae595550 252479 07:52:27 - 248315.cinder-volume 252480 07:52:27 - 248315.cinder-volume.node1 252481 07:52:27 - 248315.cinder-volume_fanout_df149604027d49fabd8853f3acb7e997 252549 07:52:49 - 248315.cinder-volume.dev202 252550 07:52:49 - 248315.cinder-volume_fanout_4bab111c0de74d8285b12ba4990d9ec9

then I stop cinder-volume service on node1(with command kill) list queue active: 252353 07:50:56 - 248315.cinder-scheduler 252354 07:50:56 - 248315.cinder-scheduler.node1 252355 07:50:56 - 248315.cinder-scheduler_fanout_e0ef7071e1b744769df5f06dae595550

all of cinder-volume queues were deleted. After this operation, cinder create volume status became creating, even cinder-volume on dev202 is still active but no queue message receive. However, cinder-volume on dev202 still report its status to cinder-scheduler.

How to solve this problem? I think that the primary reason is the queue delete message.

edit retag flag offensive close merge delete

18 answers

Sort by » oldest newest most voted
0

answered 2013-06-05 03:01:23 -0500

rmm0811 gravatar image

I think this is a bug of qpid python client, and report it to qpid projects https://issues.apache.org/jira/browse/QPID-4903 (https://issues.apache.org/jira/browse...)

edit flag offensive delete link more
0

answered 2013-06-04 08:32:38 -0500

rmm0811 gravatar image

I report bug https://bugs.launchpad.net/cinder/+bug/1187298 (https://bugs.launchpad.net/cinder/+bu...)

edit flag offensive delete link more
0

answered 2013-06-03 02:16:03 -0500

rmm0811 gravatar image

Thanks for your reply. I am sure about that I use ntpd to sync the UTC date.

+---------------------+---------------------+------------+---------+----+---------------------+------------------+------------------+--------------+----------+-------------------+ | created_at | updated_at | deleted_at | deleted | id | host | binary | topic | report_count | disabled | availability_zone | +---------------------+---------------------+------------+---------+----+---------------------+------------------+------------------+--------------+----------+-------------------+ | 2013-06-03 02:07:21 | 2013-06-03 02:14:33 | NULL | 0 | 1 | node1 | cinder-scheduler | cinder-scheduler | 43 | 0 | test:dev188 | | 2013-06-03 02:07:24 | 2013-06-03 02:14:34 | NULL | 0 | 2 | node1 | cinder-volume | cinder-volume | 25 | 0 | test:dev188 | | 2013-06-03 02:07:39 | 2013-06-03 02:14:28 | NULL | 0 | 3 | dev202@driver_2 | cinder-volume | cinder-volume | 40 | 0 | test:dev202 | | 2013-06-03 02:07:39 | 2013-06-03 02:14:28 | NULL | 0 | 4 | dev202@driver_3 | cinder-volume | cinder-volume | 40 | 0 | test:dev202 | | 2013-06-03 02:07:39 | 2013-06-03 02:14:28 | NULL | 0 | 5 | dev202@driver_1 | cinder-volume | cinder-volume | 40 | 0 | test:dev202 | +---------------------+---------------------+------------+---------+----+---------------------+------------------+------------------+--------------+----------+-------------------+ 5 rows in set (0.00 sec)

I think #14 my description about the reason of problems is right.

edit flag offensive delete link more
0

answered 2013-05-31 11:18:47 -0500

Hi renminmin,

Can you check your UTC date on your env? Maybe the gap between your UTC time on your scheduler node and on your volume node is larger than the default value 60. So try to modify the property service_down_time in your cinder.conf to 180 or larger. Good luck!

edit flag offensive delete link more
0

answered 2013-05-31 02:47:12 -0500

rmm0811 gravatar image

Add try .... except .... change to line 386 only solve cinder-scheduler or nova-compute service which is the similar implementation stop raise exception. However, all cinder-volume queue be removed when one of multi-cinder-volume service stop. It is another problem.

I use pdb module to trace two different sevice stop(cinder-scheduler and cinder-volume). Let me describe two different implemention stop service.

cinder-scheduler catch the signal to stop will to call _launcher.stop() cinder/service.py line 612 _launcher.stop() will kill all service thread which run service.start and service.wait . After thread killed, I found that connection.session.recievers is [], that means all consumer released. I'm not sure connection closed or not. I found that the method kill() of class service not be called.

cinder-volume launch two processes, service run in child process (service.py line 227) and parent process watch the status of child. When parent process catch to stop signal, it send the stop signal to child process. child process will catch signal and call service.stop (service.py line 239)

And I use pdb to trace stop steps. I found that connection.session.receivers is not [] and including three receivers(cinder-volume, cinder-volume.node1, cinder-volume_fanout) qpid will remove receivers of session, then MessageCancel and QueueDelete will set to qpidd. I think QueueDelete told the qpidd to delete all cinder-volume queues.

edit flag offensive delete link more
0

answered 2013-05-30 14:32:20 -0500

rmm0811 gravatar image

I am not sure that set session.receivers and session.senders to null list directly to solve the problem.

edit flag offensive delete link more
0

answered 2013-05-30 14:03:57 -0500

rmm0811 gravatar image

Hi Michael, thanks for your replay. I think it is a bug of qpid as rpcbackend.

Other service(nova-compute, cinder-scheduler, etc) use eventlet thead to run service. They stop service use thread kill() method. The last step rpc.cleanup() just did nothing, because the relative consume connection run in thread and killed. I think it is unnecessary. All queue is auto-delete, they will be removed when all receiver disappear.

However, cinder-volume use process to run service, so stop service need to close connection and receiver (consumer) of the session of connection need to close when call connection.close(). receiver close will sent MessageCancel and QueueDelete message to broker(qpid server), so that all cinder-volume queue be removed.

I think that the reason of problem confused me.

But I don't know how to solve it.

edit flag offensive delete link more
0

answered 2013-05-29 14:33:26 -0500

hubcap gravatar image

The processlauncher is to spin up > 1 (or 1) child processes and control them. If you were running a multi backend and had 4 backends you would have 4 processes spun up by the process launcher, and when you exited the cinder-volume it would close them all. te process launcher code does not do anything w/ the queues though, thats in the Service code.

edit flag offensive delete link more
0

answered 2013-05-29 02:49:14 -0500

rmm0811 gravatar image

There are differents between /usr/bin/cinder-volume with others(cinder-scheduler/ nova-compute/nova-scheduler etc). /usr/bin/cinder-volume service start method: launcher = service.ProcessLauncher() server = service.Service.create(binary='cinder-volume') launcher.launch_server(server) launcher.wait()

Others method: server = service.Service.create(binary='nova-compute', topic=CONF.compute_topic, db_allowed=False) service.serve(server) service.wait()

Then I changed /usr/bin/cinder-volume to service.wait() method as same as others, problem confused me disappear. when stop cinder-volume service, the critical info appear

The different of two methods is whether fork child process or not.

Is it the reason of problem confused me ?

Could anyone help me?

edit flag offensive delete link more
0

answered 2013-05-28 07:30:23 -0500

rmm0811 gravatar image

Because I encountered this problems, however others services except cinder-volume never appear this problems. Then I found other services log print some critical info, error at self.connection.close() So I delete self.connection.close() which should not be removed, I watch qpid queue infomation, the problem which I confused on multi-cinder-volumes disappear. In other words, not all cinder-volume queue be removed, just node1 cinder-volume.node1 and cinder-volume_fanout_bdfd1086647d4bb68859efebf01d77f7

I think the problem may be a bug.

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Get to know Ask OpenStack

Resources for moderators

Question Tools

1 follower

Stats

Asked: 2013-05-23 08:00:21 -0500

Seen: 567 times

Last updated: Jun 05 '13