Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

Why does my cluster see so many errors?

I've converted an object cache from using local disk (8 drives, RAID10) to swift (6 servers, 4 drives per server). I'm getting significant amount of swift errors, to the point where several times a day my clients fail to read an object. They'll get part of the file, but then swift gets a 10s ChunkReadTimeout and the transfer is halted.

Looking at the server metrics in OpenTSDB, these errors always occur when a bunch of disks all show very high iostat.disk.msec_total. So my assumption is that something is happening on swift that's hammering the disks. But I don't understand why swift is putting such a heavy load on them.

One thing that might be an unusual swift use case, is that this cache is supposed to be LRU. To make that happen, every time there's a read from the cache and the entry will expire soonish, we update the delete-at time. This happens about 15 times per minute.

This swift cluster only contains a single active container (about 4TB, 500,000 objects) with another 2TB, 150,000 objects in another container that was used for testing and is now just sitting there.

Combining the logs from all the swift servers, I see: DELETE: 20/second (container, object) HEAD: 4/second (container, object, account) GET: 7/second (container, object, account) PUT: 5/second (container, object, account)

Why are there so many deletes? Is the load I'm putting on the cluster just more than swift can handle?