# Why does my cluster see so many errors?

I've converted an object cache from using local disk (8 drives, RAID10) to swift (6 servers, 4 drives per server). I'm getting significant amount of swift errors, to the point where several times a day my clients fail to read an object. They'll get part of the file, but then swift gets a 10s ChunkReadTimeout and the transfer is halted.

Looking at the server metrics in OpenTSDB, these errors always occur when a bunch of disks all show very high iostat.disk.msec_total. So my assumption is that something is happening on swift that's hammering the disks. But I don't understand why swift is putting such a heavy load on them.

One thing that might be an unusual swift use case, is that this cache is supposed to be LRU. To make that happen, every time there's a read from the cache and the entry will expire soonish, we update the delete-at time. This happens about 15 times per minute.

This swift cluster only contains a single active container (about 4TB, 500,000 objects) with another 2TB, 150,000 objects in another container that was used for testing and is now just sitting there.

Combining the logs from all the swift servers, I see: DELETE: 20/second (container, object) HEAD: 4/second (container, object, account) GET: 7/second (container, object, account) PUT: 5/second (container, object, account)

Why are there so many deletes? Is the load I'm putting on the cluster just more than swift can handle?

edit retag close merge delete

Sort by » oldest newest most voted

My client sees these errors as the connection being closed before all data is received.

I don't understand what we would find out by adjusting node_timeout.

There are all kinds of errors. :-P I've pasted some recent ones below. The logs I pasted are for all servers. You can see immediately after the timestamp it shows the hostname of the server the data came from.

I've doubled the number of object workers from 8 to 16. I'll see if that reduces errors noticeably.

Here are some errors: swift105: Oct 17 14:24:32 swift105 object-server ERROR container update failed with 172.17.10.74:6001/d4 (saving for async update later): Timeout (3s) (txn: txe1754390285a49c38cbd128aecaa5fe7) swift105: Oct 17 14:24:32 swift105 object-server ERROR container update failed with 172.17.10.74:6001/d4 (saving for async update later): Timeout (3s) (txn: txf6c8946ffe464ee082891185f8d9d189) swift105: Oct 17 14:25:03 swift105 object-server ERROR container update failed with 172.17.10.72:6001/d3 (saving for async update later): ConnectionTimeout (0.5s) (txn: txa90447e0ab4044ea95a9c0d03a0c53aa) swift105: Oct 17 14:25:40 swift105 object-server ERROR container update failed with 172.17.10.73:6001/d4 (saving for async update later): Timeout (3s) (txn: tx10e193ec5c9c44a08a1918580b789d57) swift105: Oct 17 14:25:40 swift105 object-server ERROR container update failed with 172.17.10.73:6001/d4 (saving for async update later): Timeout (3s) (txn: tx2695ba920505408eb91dd23faaeee349) swift105: Oct 17 14:25:40 swift105 object-server ERROR container update failed with 172.17.10.73:6001/d4 (saving for async update later): Timeout (3s) (txn: tx51fdefef4dc744d99461d579e37a8d43) swift105: Oct 17 14:25:40 swift105 object-server ERROR container update failed with 172.17.10.73:6001/d4 (saving for async update later): Timeout (3s) (txn: txd50e07b5f449486a884c24bf7852cf26) swift105: Oct 17 14:25:40 swift105 object-server ERROR container update failed with 172.17.10.73:6001/d4 (saving for async update later): Timeout (3s) (txn: txed7ab47336ef4900b3bd4eb98598917f) swift105: Oct 17 14:25:40 swift105 object-server ERROR container update failed with 172.17.10.73:6001/d4 (saving for async update later): Timeout (3s) (txn: txfed5461a2d644faa9b4c3aac5a3ddce5) swift105: Oct 17 14:26:35 swift105 object-server ERROR container update failed with 172.17.10.74:6001/d2 (saving for async update later): ConnectionTimeout (0.5s) (txn: tx3731beb2dedf46eabe748bae78e30c85) swift105: Oct 17 14:26:35 swift105 swift ERROR with Object server 172.17.10.73:6000/d2 re: Trying to DELETE /.expiring_objects/1381536000/1381541992-AUTH_software/rpmcache/afe96635c9f7f8372525a970511a64ba5eb0702f14a4664dc240045c7308a260.data: ConnectionTimeout (0.5s) (txn: txa7a32e4bafd44d83b23451a14c05bbd1) swift105: Oct 17 14:26:35 swift105 swift ERROR with Object server 172.17.10.75:6000/d4 re: Trying to DELETE /.expiring_objects/1381536000/1381541992-AUTH_software/rpmcache/afe96635c9f7f8372525a970511a64ba5eb0702f14a4664dc240045c7308a260.data: ConnectionTimeout (0.5s) (txn: txa7a32e4bafd44d83b23451a14c05bbd1) swift105: Oct 17 14:32:53 swift105 object-server ERROR container update failed with 172.17.10.72:6001/d2 (saving for async update later): ConnectionTimeout (0.5s) (txn: tx3a05ad050e124af4a56fa74b56b1e5df) swift105: Oct 17 14:32:53 swift105 object-server ERROR container update failed with 172.17.10.72:6001/d2 (saving for async update later): ConnectionTimeout (0.5s) (txn: tx8383669ad0ba44e3a155b1664b03840c) swift105: Oct 17 ...

more

Hi Tim,

For the two error logs that you posted, the first appears to be a 10MB file, of which 6,553,600 (or 100 64KB chunks) have been transferred, and then the proxy server SUCCESSFULLY declared the transfer finished because the object server did not respond for the 101st chunk. The client should see that the connection was closed prematurely with missing data, or might timeout waiting for more data if connection keep-alive was used.

The second of the two error logs is similar to the first, except that it was for a 4GB file, which had only 1 chunk transferred before the object server timed out.

Raising your node_timeout value in the proxy server configuration might help us determine the scope of the problems on the object server.

You have another error in the big log you posted, which seems to be reporting that the proxy server cannot connect to an object server. Are there other errors on the system that might correlate with the object servers? Are these logs all taken from one system, or from all your storage nodes?

I would suggest raising the number of object workers, doubling or tripling it to see if that helps, and if it doesn't, it might help us understand the behavior we are dealing with depending on what results.

more

The specific error that I'm talking about is when my client application does a GET, but the connection gets closed before all bytes are transferred. This looks as follows in the swift logs: Oct 14 02:51:13 swift104 object-server 172.17.10.73 - - [14/Oct/2013:09:51:13 +0000] "GET /d4/201683/AUTH_software/buildcache/76b67f599086944d4ee4eb14b9391b9d2023aa9f.data" 200 10270720 "-" "txe41b50d1e237489a862b1e7bd59daf8c" "-" 0.0005 Oct 14 02:51:30 swift102 proxy-server 172.22.34.14 172.22.34.14 14/Oct/2013/09/51/30 GET /v1/AUTH_software/buildcache/76b67f599086944d4ee4eb14b9391b9d2023aa9f.data HTTP/1.0 200 - - software%2CAUTH_tk62d08be9f72543fba543d5dbe673c1e7 - 6553600 - txe41b50d1e237489a862b1e7bd59daf8c - 16.5518 - Oct 14 02:51:30 swift102 proxy-server ERROR with Object server 172.17.10.75:6000/d4 re: Trying to read during GET: ChunkReadTimeout (10s) (txn: txe41b50d1e237489a862b1e7bd59daf8c) (client_ip: 172.22.34.14)

Oct 14 06:11:22 swift102 object-server 172.17.10.74 - - [14/Oct/2013:13:11:22 +0000] "GET /d4/206078/AUTH_software/buildcache/0bb1b59f6f177ceace00c0bcdf56eca221b06cb5.data" 200 4783308800 "-" "tx1d86b18d378f4b94a2caadaccb70fadd" "-" 0.0584 Oct 14 06:11:32 swift103 proxy-server 172.22.34.14 172.22.34.14 14/Oct/2013/13/11/32 GET /v1/AUTH_software/buildcache/0bb1b59f6f177ceace00c0bcdf56eca221b06cb5.data HTTP/1.0 200 - - software%2CAUTH_tk62d08be9f72543fba543d5dbe673c1e7 - 65536 - tx1d86b18d378f4b94a2caadaccb70fadd - 10.5687 - Oct 14 06:11:32 swift103 proxy-server ERROR with Object server 172.17.10.73:6000/d4 re: Trying to read during GET: ChunkReadTimeout (10s) (txn: tx1d86b18d378f4b94a2caadaccb70fadd) (client_ip: 172.22.34.14)

I don't remember playing with ThreadPools. Given that my problem is disk I/O bandwidth, I don't think it makes a difference how those accesses get in the queue (except that it would help to not audit each disk in turn). We're running swift 1.4.0, with the following configuration (on swift100): account-server.conf:[DEFAULT] account-server.conf:workers = 4 account-server.conf:log_statsd_host = 172.17.23.35 account-server.conf:log_statsd_port = 8125 account-server.conf:log_statsd_default_sample_rate = 0.5 account-server.conf:[pipeline:main] account-server.conf:pipeline = recon account-server account-server.conf:[app:account-server] account-server.conf:use = egg:swift#account account-server.conf:[account-replicator] account-server.conf:[account-auditor] account-server.conf:[account-reaper] account-server.conf:[filter:recon] account-server.conf:use = egg:swift#recon account-server.conf:recon_cache_path = /var/cache/swift container-server.conf:[DEFAULT] container-server.conf:workers = 8 container-server.conf:[pipeline:main] container-server.conf:pipeline = recon container-server container-server.conf:[app:container-server] container-server.conf:use = egg:swift#container container-server.conf:[container-replicator] container-server.conf:[container-updater] container-server.conf:[container-auditor] container-server.conf:[container-sync] container-server.conf:[filter:recon] container-server.conf:use = egg:swift#recon container-server.conf:recon_cache_path = /var/cache/swift object-expirer.conf:[DEFAULT] object-expirer.conf:swift_dir = /etc/swift object-expirer.conf:user = swift object-expirer.conf:log_statsd_host = 172.17.23.35 object-expirer.conf:log_statsd_port = 8125 object-expirer.conf:log_statsd_default_sample_rate = 1 object-expirer.conf:[object-expirer] object-expirer.conf:interval = 300 object-expirer.conf:auto_create_account_prefix = . object-expirer.conf:report_interval = 300 object-expirer.conf:concurrency = 1 object-expirer.conf:processes = 6 object-expirer.conf:process = 0 object-expirer.conf:[pipeline:main] object-expirer.conf:pipeline = catch_errors cache proxy-server object-expirer.conf:[app:proxy-server] object-expirer.conf:use = egg:swift#proxy object-expirer.conf:[filter:cache] object-expirer.conf:use = egg:swift#memcache object-expirer.conf:[filter:catch_errors] object-expirer.conf:use = egg ...

more

Please pardon all these questions, if you have already answered them (and so I just could not dig through the history to pull out the answers well enough).

Is it the same error as what you first reported? 10s ChunkReadTimeout? Can you post them?

Can I assume that each node in this swift configuration is running proxy, account, container, and object servers? I think the object server worker counts are at 8, is that still the case? Did you try to set max_clients lower, raising the number of workers? Did you try ThreadPools (and for that matter, what version of Swift are you running)?

Can you post the output of "xfs_info" on a disk? We have not seen much benefit, if at all, with inode sizes of 1,024. Additionally, for large disks, the default allocation groups that XFS uses can be kinda small (by default 4 allocation groups per 1 TB, so hoping it is at 16 or higher for your disks). And depending on how big your partitions on a disk grow to, you might benefit from a larger name size, 8192 or higher depending.

Are you using 1ge or 10ge networking? And do you a single nic for all traffic, front and backend, or do you separate your client traffic from your inter-swift-node traffic.

I don't think memcache is a factor here, but curious about where your memcache servers are located?

Finally, can you post a one minute capture of just your proxy logs, scrubbed?

more

So I put some more load on the cluster, and am back to one failure every two hours.

What's the general recommendation on retries? These failures happen infrequently enough that I bet I can just retry and it'll work. The extra delay is not a big problem for my application. Is this a good idea?

more

Right. We went with the dumb approach, and run every swift process (including the proxy) on every server. The only exception is the object-expirer, which we only run on a single server. Each of 6 servers has a single E31270 (3.4GHz, 8 cores including hyperthreading), 16GB RAM, 4 x 4TB disk. The servers never swap. We're using isize=1024 with a patched Linux kernel 3.4.43 x86_64. (I believe all our patches are network related and shouldn't affect disk I/O.)

more

Thanks for providing so much feedback, this sort of open kimono down in the weeds turning from a variety of deployments is exactly what's needed to understand true "best practices" rather than a collection of cargo cult information silos.

The storage nodes are combined account/container/object yes? What is your part power and how much RAM and how many devices do you have in your storage nodes? Do you see any signs of swap usage when the load is high? You might try cranking up vfs_cache_pressure.

When you created the filesystem what did you specify for the i node size (you can check "isize" with sudo xfs_info /dev/sdX)?

http://docs.openstack.org/developer/swift/deployment_guide.html?highlight=inode#filesystem-considerations (http://docs.openstack.org/developer/s...)

Thanks again for sharing your experience.

more

OK, deadline scheduler makes a big improvement. With that I don't see any errors, as long as I keep files_per_second and zero_bytes_files_per_second at 1. At 10 I'm seeing one client-visible error every 10 hours. I'm experimenting with 5.

more

Thanks for the zero_byte_files_per_second pointer. I missed that. Setting it to 1 (which is effectively turning it off), I still get 4 client-visible errors per day, where a GET fails.

We were just using the cfq scheduler. I switched it to the deadline scheduler. I'll report back what happens.

I'm still a bit concerned that so much tuning is required for what should be a light load on the overall cluster.

more

I guess I'm not surprised that the auditors cause an undesierable amount of IO load w/o turning. But in a healthy cluster I wouldn't expect the replicators to be doing enough work to pin the object server's into a timeout?

What io scheduler are you using (i.e. /sys/block/*/queue/scheduler)? I'd recommend probably... deadline?

There's lots of ways to do more IO shaping, maybe isolating sqlite/container work-load to separate drives/devices/servers, more RAM in storage nodes, tuning vfs_cache_presure or io scheduler, and in the end maybe we need to add some tunable for replication (something based on job/suffix in the main replicate loop?).

more