Ask Your Question
0

soft lockup on Newton compute nodes

asked 2017-10-18 22:27:05 -0500

jamesopst gravatar image

updated 2017-11-10 09:02:02 -0500

hi all,

please help us out with an issue we are seeing on multiple compute nodes running Newton (Ubuntu 16.04.3 Kernel 4.4.0). After about 1 hour of running our VOIP test application the instances become non-responsive and can't be pinged as well do the compute nodes. These messages appear on the compute node console screens.

image description

The first compute node this was seen on was running 2 instances, the second was running only 1 instance. They were using on a portion of the total 40 vCPUs available, and the load was moderate. Cold boot these nodes and all is well again, until we run our application for about 1 hour.

please let us know what you think thanks!

not a lot is shown in DEBUG logging of Nova and Neutron on the compute node

these logs are here:download logs.zip

===== UPDATE 10/23 ======

we have been trying different things to get better debug we disabled rate-limiting in order to get better info in /var/log/message. for some reason (maybe unrelated) we didn't get the soft lockup during this test But this time we got openvswitch, br_netfilter, etc in the call trace in /var/log/messages

Please advise in any way! thx!!

basically we are running various types of SIP/RTP test traffic between 2 instances (on different compute nodes). This time instead of one hypervisor getting the errors both hypervisors did, but neither got the soft lockup.

===== UPDATE 11/10 ======

based on some advice from a member of the mailing list we've been looking into kernel and driver versions of our compute nodes

We also have plain non openstack "KVM on Ubuntu" servers for testing.

I looked at driver and kernel differences between these Ubuntu 16 w/ KVM systems and our openstack compute nodes. I found Ubuntu 16 w/ KVM was at kernel version 4.4.0-87 and that the openstack compute nodes were at 4.4.0-93. So I upgraded the Ubuntu 16 w/ KVM to 4.4.0-93 and was able to reproduce this problem (but only on the exact HP hardware that is our openstack compute nodes, and not on other hardware). Next I updated these Ubuntu 16 w/ KVM to 4.4.0-98 and the problem no longer occured!

I need to upgrade a few openstack compute nodes from 4.4.0.93 to 4.4.0-98 and test.

Does anyone think this kernel change could break openstack?

In the kernel change log I found a fix for a specific HP server in 4.4.0-98 (not the same as our server but somewhat similar)

thanks!

from the ######## UPDATE 10/23####### log snippetes below, full logs here: node-68.txt node-90.txt

**node-68**

2017-10-20T18:35:10.759230+00:00 node-68 rsyslogd: [origin software="rsyslogd" swVersion="8.16.0" x-pid="3431" x-info="http://www.rsyslog.com"] exiting on signal 15.
2017-10-20T18:35:10.790611+00:00 node-68 rsyslogd: [origin software="rsyslogd" swVersion="8.16.0" x-pid="23851" x-info="http://www ...
(more)
edit retag flag offensive close merge delete

Comments

changed question with some added debug from a test we ran today

jamesopst gravatar imagejamesopst ( 2017-10-23 21:21:32 -0500 )edit

Hi, what is the image used for instances where your VOIP application is hosted? what is the flavor used? Have you created any new customized flavor and using it for instance launching?

Praveen N gravatar imagePraveen N ( 2017-10-25 05:57:37 -0500 )edit

Please Share the output of sysctl --all on from your compute node? --Regards Praveen

Praveen N gravatar imagePraveen N ( 2017-10-25 06:00:35 -0500 )edit

sorry Praveen I never noticed your response! We are using a Centos 7.4 Cloud image for our VOIP instances. flavor is custom 8 vCPU with 12GB RAM and 60GB vHD. I posted progress the original question. Does anyone think upgrading a compute node from kernel 4.4.0.93 to 4.4.0.98 would break openstack?tx

jamesopst gravatar imagejamesopst ( 2017-11-10 08:57:37 -0500 )edit

1 answer

Sort by ยป oldest newest most voted
0

answered 2017-11-20 13:29:27 -0500

jamesopst gravatar image

upgrading a compute node from kernel 4.4.0.93 to 4.4.0.98 fixed the softlockup issue. This kernel change does not seem to have broken anything in openstack

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Get to know Ask OpenStack

Resources for moderators

Question Tools

1 follower

Stats

Asked: 2017-10-18 22:27:05 -0500

Seen: 328 times

Last updated: Nov 20 '17