Network connection leak in Keystone?

asked 2015-03-20 14:06:50 -0500

SamTheMan gravatar image

I'm trying to track down something that looks like a bug, but I'm not entirely sure.

I noticed yesterday many calls to the Keystone API were failing with 401 errors but retrying them would sometimes work. I looked in the Keystone logs and found this error repeated many times:

TRACE keystone.common.wsgi OSError: [Errno 24] Too many open files

Using netstat confirmed Keystone had thousands of open network sockets and couldn't create more. Restarting Keystone cleared the open connections and everything returned to normal. Since then, however, I've been watching the number of open connections to the Keystone processes -- it's climbing again and never goes down. The memory footprint of the Keystone processes is going up too.

In our setup, most of the API services are running on a single host, plus three compute nodes for VMs (CentOS 7, not Devstack). We're using Jenkins with the jClouds-plugin to spin up build nodes and delete them after they've been idle for a while. The open connections seem to be triggered by jClouds-plugin's sequence of terminating an instance: it detaches the floating IP, deletes the floating IP (not exactly sure what this means), then deletes the server. I wrote up my own Java client using jClouds to reproduce this sequence and it seems to also leak connections, though not in the large quantities the Jenkins plugin does. When I use Jenkins to create 10 instances, wait for them to come online, then delete them all, Keystone leaks anywhere from 8-15 connections (different amounts each time). Creating and deleting instances through the Horizon web interface doesn't do this, even when I add and remove floating IPs to reproduce the sequence.

The leaked connections are between Openstack services, not between Jenkins and Openstack, so restarting Jenkins does not close the connections. When I look at the sources and destinations, about half of the connections are outbound from Keystone to our memcached server. The other half are inbound to Keystone on port 35357 and appear to be coming from other Openstack services (glance-registry, glance-api, nova-api and cinder-api). In an attempt to mitigate this issue, I've increased the max number of open file descriptors python can have from 1024 to 10240.

So here's my question: does each service use a connection pool for Keystone requests and those pools are, in total, creating too many connections? If so, just increasing the max file descriptors should be the right long-term fix, right? Or is this a bug?

edit retag flag offensive close merge delete