swift-container-replicator timeouts - uploaded too many objects

asked 2016-05-18 04:46:14 -0500

updated 2016-06-01 04:09:49 -0500

Hi!

I've been doing performance benchmarks on a small Swift cluster (1 proxy, 3 storage nodes). In one of those I've uploaded 6.4M objects into a single container. Afterwards I understood that you're not supposed to do that (because container sharding is not yet implemented?) and keep it to a limit of about 1M objects per container. Now I've deleted the container and all the objects in it, but the errors in my swift-container-replicator log files won't go away, so I haven't completely repaired the situation. I would like the errors to disappear before I continue benchmarking (I'm not sure if my future benchmarks will be representative afterwards).

An example of the error that keeps popping up is added below (the logging from one container replication run). There are always a few errors on every replication run, but it's never tied to the same devices. Restarting the container-replicator processes does not help.

What kind of further 'container cleanup' can I do, besides reformatting all drives and deploying new rings? (I'd like to avoid that).

And another question related to this: during the benchmarks I noticed that writing the 6.4M objects (about 3.7TB in total, in about 3 hours), the write performance would only slightly decrease over time (possibly due to the container DB being abused), but when deleting the objects (at a slightly lower pace, about half the threads of the write benchmark) performance would start to decrease quickly, with huge variations in response time. Can anybody with a lot of knowledge of Swift's internals explain why this happened?

Thanks a lot in advance!

Kind regards, Pieter van Wijngaarden

Update June 1st 2016: the errors (after lasting for several days) seem to have disappeared automagically. While this is good news, I would still appreciate it a lot if one of the Swift core developers (or somebody who knows Swift inside out) could tell me what kind of behavior I was observing :).

May 18 11:31:04 test-cloud01 container-replicator: Beginning replication run
May 18 11:31:20 test-cloud01 container-replicator: ERROR reading HTTP response from {'index': 2, 'replication_port': 6001, 'weight': 100.0, 'zone': 1, 'ip': '192.168.40.4', 'region': 1, 'id': 22, 'replication_ip': '192.168.40.4', 'meta': u'', 'device': 'sdb', 'port': 6001}: Timeout (10s)
May 18 11:31:20 test-cloud01 container-replicator: ERROR reading HTTP response from {'index': 2, 'replication_port': 6001, 'weight': 100.0, 'zone': 1, 'ip': '192.168.40.4', 'region': 1, 'id': 32, 'replication_ip': '192.168.40.4', 'meta': u'', 'device': 'sdl', 'port': 6001}: Timeout (10s)
May 18 11:31:20 test-cloud01 container-replicator: ERROR reading HTTP response from {'index': 2, 'replication_port': 6001, 'weight': 100.0, 'zone': 1, 'ip': '192.168.40.3', 'region': 1, 'id': 14, 'replication_ip': '192.168.40.3', 'meta': u'', 'device': 'sde', 'port': 6001}: Timeout (10s)
May 18 11:31:35 test-cloud01 container-replicator: ERROR reading HTTP response from {'index': 0, 'replication_port': 6001, 'weight': 100.0, 'zone': 1, 'ip': '192.168.40.3', 'region ...
(more)
edit retag flag offensive close merge delete