How to reduce downtime for vm live migrations ?

asked 2018-12-21 01:20:15 -0500

voidking gravatar image

updated 2018-12-21 04:22:52 -0500

I set up a Openstack System with four nodes by kolla-ansible. One controller node, one network node, and two compute nodes. The two compute nodes share a NFS. I created an instance named ubuntu0 by ubuntu16 cloud version image. Then I migrate the instance between two compute nodes. During the migration, I would ping the instance.

sudo ping 10.0.2.159 -i 0.01 >> ping.log

When migration finished, I will caculate the downtime by total time and package loss. Theoretically, the downtime should less then 1 second. However, my downtime alway more then 4 second. This result confused me. I hope someone who is good at it could give me some advice. Thank you very much!

@Bernd Bausch Thanks for your answer. I thought about your point of view carefully. It sounds reasonable. However, others did the same experiment too. Like the article https://blog.zhaw.ch/icclab/an-analys... (link text) . If the migration requires message queue and API communication, how can they get the downtime which is less than 1s ? They caculated the downtime by ping and I took the same method. Maybe you are right, but I don't know how to delete the API communication time. Any advice? For your questions: I reviewed the ping.log and found that the lost icmq_seq are concentrated. I.e. the lost time is a continuous time. When the instance is not migrating, the package loss will be zero.

edit retag flag offensive close merge delete

Comments

My guess is that it takes some time to recreate network connections, which requires message queue and API communication. Do you have an idea where the time is lost? I.e. are the lost packages concentrated or equally spread out?

BTW, how many packages do you lose when the instance is not migrating?

Bernd Bausch gravatar imageBernd Bausch ( 2018-12-21 02:07:42 -0500 )edit

Thanks for your answer. The comment has a length limit, so I reply you in the answer column.

voidking gravatar imagevoidking ( 2018-12-21 04:08:40 -0500 )edit

Sorry for deleting your answer by mistake, while attempting to move the text to the question. Foolish me.

You mention sub-second test results at https://blog.zhaw.ch/icclab/an-analys..., and the fact that packet loss is clustered in one place.

Bernd Bausch gravatar imageBernd Bausch ( 2018-12-21 04:14:01 -0500 )edit

I can’t say much except that your cloud is different from the test cloud in Zürich. How to improve API and message queue performance? Network capacity, and don’t forget that API and MQ servers are processes that compete for resources on the controllers.

Bernd Bausch gravatar imageBernd Bausch ( 2018-12-21 04:17:07 -0500 )edit

Network (and storage) reconnection is done on the compute nodes, so their performance counts as well. To get to the bottom of this, however, you would have to add measurement points to the code. Since the packet loss seems to be happening in a single phase of the process, it would be quite possible.

Bernd Bausch gravatar imageBernd Bausch ( 2018-12-21 04:21:15 -0500 )edit