Why does my cluster see so many errors?

I've converted an object cache from using local disk (8 drives, RAID10) to swift (6 servers, 4 drives per server). I'm getting significant amount of swift errors, to the point where several times a day my clients fail to read an object. They'll get part of the file, but then swift gets a 10s ChunkReadTimeout and the transfer is halted.

Looking at the server metrics in OpenTSDB, these errors always occur when a bunch of disks all show very high iostat.disk.msec_total. So my assumption is that something is happening on swift that's hammering the disks. But I don't understand why swift is putting such a heavy load on them.

One thing that might be an unusual swift use case, is that this cache is supposed to be LRU. To make that happen, every time there's a read from the cache and the entry will expire soonish, we update the delete-at time. This happens about 15 times per minute.

This swift cluster only contains a single active container (about 4TB, 500,000 objects) with another 2TB, 150,000 objects in another container that was used for testing and is now just sitting there.

Combining the logs from all the swift servers, I see: DELETE: 20/second (container, object) HEAD: 4/second (container, object, account) GET: 7/second (container, object, account) PUT: 5/second (container, object, account)

Why are there so many deletes? Is the load I'm putting on the cluster just more than swift can handle?

edit retag close merge delete

Sort by » oldest newest most voted

FWIW, "zero_byte_files_per_second" controls the zero-byte-file audit rate.

more

I am using POST to update the delete-at time. I do have object_post_as_copy set false. (At least it's there in the config file. I didn't see a way to confirm that it's actually happening.) From your reaction it sounds like a weird thing to do. What is the usual solution to a problem like this? Store delete-at in some other db, and write my own object expirer?

I have read the general service tuning section, but it doesn't give much guidance. I'm running the object-server with 8 workers. Presumably that's enough to avoid problems caused by one object server being temporarily hung. The CPU on each server is only 50% used. Interestingly it's the container server/replicator that are consistently at the top of top output. Fairly often I see a container-replicator using 100% of one core. Is there any downside to having too many workers? I'm thinking of just giving each server 16 workers, with a concurrency of 1, and let Linux sort it out.

Does the object expirer log a delete also when an old copy of metadata is deleted? If so, that would explain the large number of deletes, and those files should be tiny so deleting them shouldn't add significant load to anything.

more

Oh wow, are you using a POST to update the delete-at time? Did you change object_post_as_copy to false?

http://docs.openstack.org/developer/swift/deployment_guide.html#proxy-server-configuration (http://docs.openstack.org/developer/s...)

http://docs.openstack.org/developer/swift/deployment_guide.html#general-service-tuning (http://docs.openstack.org/developer/s...)

I'm sure the expirer will log, there's some amplification of delete's when the .expiring_objects container entries are updated. More info:

http://docs.openstack.org/developer/swift/overview_expiring_objects.html?highlight=expire (http://docs.openstack.org/developer/s...)

more

I've not heard of anyone attempting to run Swift as an LRU using the object-expirer feature.

I'm not sure how well this will work, and I don't want to discourage you!

Using fast-POST is definitely needed to even attempt something like this - but it's not the default behavior anymore - so I wanted to make sure you'd gotten at least that far. You can be sure it's working by uploading a really big object (2-5GB) and making a POST - if a response comes back in less than 5 seconds you've validated you've correctly configured fast-POST.

The .expiring_objects container's are being used like a queue. Every POST to update X-Delete-At will delete the old container entry for the delete marker and add a new one. From your description it sounds like this is happening for every request - that probably explains why your container servers are so busy - there's probably a lot of records in those db's. The expirier I think makes some unneeded backend delete object requests when trying to clean out the delete markers - I think that was a bug report once - not sure.

More workers is probably ok, less background process concurrency might lower load but may cost you in the consistency window. On the object server's specificaly you can also try threadpools.

But the expiring objects containers are probably causing the biggest load - I'd imagine there's quite a bit of contention in those db's. Look for "Lock" or "Timeout" in the container server and replicator logs. Are the container server's on SSD's? Are they on the same server's as the objects?

I like to get a single proxy process going without auth in the pipeline so I can poke at the system account directly with 'curl http://localhost:9000/v1/.expiring_objects (http://localhost:9000/v1/.expiring_ob...) -I'. A few HEAD requests could be very informative. You can also find the db's with swift-get-nodes /etc/swift/account.ring.gz .expiring_objects then you can sqlite3 -line <HASH>.db "select * from account_stat" to see the uncommitted stats anyway (messing with the .pending files is a bit more trouble).

One idea I had was you may want to turn your "expiring_objects_container_divisor" down to something crazy like 3600 or even 600!?

more

I've confirmed that POSTs on a 2GB file takes about .25s (using the swift CLI client, so including time to authenticate). So it looks like fast-POST is enabled.

I don't update delete-at with every request (anymore). There is less than one of these per second (to the proxy server). For reference, the proxy servers see a combined 0.5 PUTs/second, 0.5 POSTs/second, 2 HEADs/second, 4 GETs/second. Which really doesn't seem that high.

Increasing the number of workers has significantly increased the disk business, and with it the number of errors swift reports.

I did discover we were running an object expirer on each of 6 servers (without configuring them to each expire their own part). I turned off 5 of them when I increased the workers, but the net result is still more disk business and errors. Just now I turned off the last object expirer, just to see if that improves things. (So far it doesn't look like it, but let's give it some time.)

All data is using the same storage, etc. I'm not really against distributing things smarter, but I'm still baffled that it would be required. This exact functionality had no problems with an 8-disk RAID10. Why does my 24-disk swift cluster have so much trouble keeping up?

I'll dig into some more of your suggestions today. I really appreciate all the help you're giving me.

more

I ran some more experiments. Specifically: 1. Running without any object expirers: No difference. 2. Disabling the POST updates in my application: No difference.

It looks like the PUT/GET/HEAD load is just too high, but the numbers are so low that I have a hard time believing that. Average object size is 26MB. Is that causing my trouble? Some objects (1 in 1000 or so) are 2GB in size.

Is there any place where I can find a list of swift cluster configurations along with the workload they support?

more

Turn off the container-replicators too, or in fact all the background consistency processes, whittle down to the bone and turn stuff back on one at a time. I think the replication checks on those expiring object containers may be the culprit.

I've never seen a "this hardware gets you these numbers" data set published anywhere. Last summit there was a lot of sessions that talked about HOW to benchmark swift (ssbench, and cosbench from intel, and some other benchmark utility from HP/Mark Seger). I'm hoping this year we'll see more people reporting some numbers out of their configurations so the community can begin to establish some baseline expectations. The swift repo ships with swift-bench, so that's fairly ubiquitous - you may try to get a simple scenario together that exhibits the ChunkReadTimeouts you're seeing.

more

OK, I spent last week doing some expirements. Turing off all background consistency processes was a good suggestion.

There are 2 processes which have enough impact to cause periodic failures: 1. object-auditor. After throttling it back a lot (1MB/s, 1 file/s) it still causes about 1 problem every 18 hours. Looking at the IO graph it still periodically hits the disks heavy. I suspect this is the ZBF audit which doesn't appear to be affected by the files_per_second configuration. 2. object-replicator. With everything except the object-auditor running, I see about 1 problem every 2 hours. I don't think there's an option to throttle it.

Both these processes seem to hit each disk in succession. Ie. they'll hit sda for a while, then sdb, then sdc, and so on. Their impact would be a lot less if they switched disks more often. Their cache hits on the file metadata would also be less, though. On the other hand, overall audit/replicate performance might be larger if they distribute their load over each disk, because presumably they're spending most of their time waiting on IO.

more

I guess I'm not surprised that the auditors cause an undesierable amount of IO load w/o turning. But in a healthy cluster I wouldn't expect the replicators to be doing enough work to pin the object server's into a timeout?

What io scheduler are you using (i.e. /sys/block/*/queue/scheduler)? I'd recommend probably... deadline?

There's lots of ways to do more IO shaping, maybe isolating sqlite/container work-load to separate drives/devices/servers, more RAM in storage nodes, tuning vfs_cache_presure or io scheduler, and in the end maybe we need to add some tunable for replication (something based on job/suffix in the main replicate loop?).

more

Thanks for the zero_byte_files_per_second pointer. I missed that. Setting it to 1 (which is effectively turning it off), I still get 4 client-visible errors per day, where a GET fails.

We were just using the cfq scheduler. I switched it to the deadline scheduler. I'll report back what happens.

I'm still a bit concerned that so much tuning is required for what should be a light load on the overall cluster.

more