Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

High si/sys values via top in instances

Looking for some help to figure out what's going on here. I'm in the process of creating a third party CI system for our project. I'm initially trying to setup 6 manually created jenkins slaves using diskimage builder and puppet to run gate jobs and will scale from there and eventually move to nodepool.

I don't think this is specific to devstack-gate. I suspect it'll do this with system activity that stresses the instance. So, we can just think of the jenkins slaves as compute node instances that have heavy usage.

My setup is as follows:

  1. Physical Servers(2): Intel 1 socket 12 core (hyperthread so 24 are seen by the hypervisor), 128gb RAM.
  2. Openstack Liberty installed as a 3 node; 1 controller, 1 compute/network (96gb RAM) , 2nd Compute (96gb RAM) as per the liberty installation guide.
  3. Openstack controller, and compute ndoe guests, were created by hand using libvirt on the respective physical server. using provider network, with linuxbridge.
  4. Backing store for jenkins slaves/openstack liberty is the local file system. Jenkins slaves are configured using puppet, images are built using diskimage builder. The standard third party setup described in the CI documentation.
  5. Jenkins slaves are 4 vcpu and 8gb of ram, 3/compute node. CPU/Memory not over-commited I have verified kvm acceleration is being used.
  6. All vm definitions are using virtio for network and disk and virtio-pci is installed. All vms using
    host-passthrough in the cpu-model in the libvirt.xml describing it.

Trying to keep it simple as I learn the ropes...

All systems are using Kernel 3.19.0-56-generic #62~14.04.1-Ubuntu SMP on Ubuntu 14.04.4 LTS (I've seen the same thing on early kernels and earlier 14.04 versions).

My issue is as follows,

If I create a single jenkins slave on a single compute node, the basic setup time (we'll ignore tempest, but a similar thing happens) to run devstack-gate is about roughly 20 minutes, sometimes less. As I scale the number of jenkins slaves on the compute node, up to 3, the setup time increases dramatically on each instance. The last run I did had it at nearly an hour on each (all 3 running concurrently). Clearly something is wrong, as I have not over-comitted memory, nor ram on either of the compute nodes.

What I'm finding is the CPU's are getting overwhelmed as I scale in the jenkins slaves. Top will show sys/si percentages eating up the majority of CPU, sometimes collectively they are taking up 70-80% of the cpu time. This will drop to what's shown below when the system becomes idle.

When the systems are idle (after one run) this is a typical view of top, mongodb is using 9.3% of the cpu, sys is at 9.8% and si at 5.2% of the available cpu (Irix mode off). The compute node and the physical server do not show this sort of load, they are typically in the 1-2% for sys, and 0, for si when the slaves are idle, but will grow a bit when the slaves are running the devstack-gate script.

top - 19:39:43 up 1 day, 39 min,  1 user,  load average: 0.65, 1.03, 1.59
Tasks: 145 total,   1 running, 144 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  9.8 sy,  0.0 ni, 77.9 id,  0.3 wa,  0.0 hi,  5.2 si,
6.5 st
KiB Mem:   8175872 total,  2620708 used,  5555164 free,   211212 buffers
KiB Swap:        0 total,        0 used,        0 free.  1665764 cached Mem
 PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
1402 mongodb   20   0  382064  48232  10912 S  9.3  0.6 162:25.72 mongod
18436 rabbitmq  20   0 2172776  54528   4072 S  4.2  0.7  20:41.26 beam.smp
20059 root      10 -10   20944    420     48 S  2.9  0.0  26:54.20 monitor
20069 root      10 -10   21452    432     48 S  2.6  0.0  25:45.48 monitor
28786 mysql     20   0 2375444 110308  11216 S  2.0  1.3  15:43.30 mysqld
3731 jenkins   20   0 4113288 114320  21160 S  1.9  1.4  31:01.35 java
3 root      20   0       0      0      0 S  1.3  0.0  10:29.24 ksoftirqd/0

When the devstack-gate script is running this is typical. Again the compute node as 0.6 for sy and 0.0 for si, when I copied this, similarly for the physical server.

top - 19:45:02 up 1 day, 44 min,  1 user,  load average: 14.67, 12.20,
11.20
Tasks: 217 total,   5 running, 212 sleeping,   0 stopped,   0 zombie
%Cpu(s): 18.9 us, 43.5 sy,  0.0 ni,  5.2 id,  0.0 wa,  0.0 hi, 32.0 si,
0.4 st
KiB Mem:   8175872 total,  4970836 used,  3205036 free,   217968 buffers
KiB Swap:        0 total,        0 used,        0 free.  1604240 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  687 jenkins   20   0   78420  21544   3296 R  45.7  0.3   4:17.87 ansible
  676 jenkins   20   0   78556  25556   7116 S  40.1  0.3   4:19.29 ansible
 1368 mongodb   20   0  382064  48508  10896 S  32.2  0.6 207:31.76 mongod
 5060 root      10 -10   20944    420     48 S  14.1  0.0  12:04.99 monitor

Digging deeper with the various perf related tools, the best I can find for a clue (used vmstat, looked at /proc/interrupts and mpstat, nothing in logs), is that when idle mongo is doing this (using strace), which is driving up the sy number. I have yet to figure out what may be driving the si number.

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    1.697270        5813       292           select
------ ----------- ----------- --------- --------- ----------------
100.00    1.697270                   292           total

and when the job is running ansible is doing this:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.72    4.536098        2925      1551           select
  0.28    0.012786           8      1551           poll
------ ----------- ----------- --------- --------- ----------------
100.00    4.548884                  3102           total

I'm at a loss on how to figure this out as this is a basic scaling issue. Suggestions on what to check, what to look at? Anyone seen this before? This appears to be something in the definition of the jenkins slave vm, as the compute node an physical server never seem to be overloaded. I'm missing something basic here, or there is a bug somewhere.

I moved to vivid based on information I found that read similar to this with mongo (wouldn't explain ansible), a fix was picked up in 3.19.0.45.

Bob H