Cluster performance issues with mass deletion

asked 2018-12-27 03:23:33 -0500

Business Requirement

At present, our business needs to upload and delete 10,000 data per minute, in which each data's size is about 700 KB. Now we have 2 servers and 36 HDD devices, network Card is also a gigabit network card.

Strategy

We calculate as follows:

IO Speed 20000*700Kb/60s = 233MB/s

IO times 20000/60s=333

A normal HDD device can provide 100 IOS or 100M/s continuous reads and writes per second. Together with redundancy or other considerations, we have prepared a lot more disks,and use a single copy strategy. 2 servers and each one mounts 18 disks, one of which is a system disk, three are used to store account and container information, and 14 are used to store object data.

Config

We use the default configuration for all configuration files in path /etc/swift.

Problem

We prepare a test JAVA progress that can use multithreading to send HTTP request to the cluster, it works once per minute to upload and delete 10000 objects. For the benchmark test data we prepare 230 million objects in the cluster after we built it. After all, we start the progress and observer the results.

At begin, the upload and delete tasks can finished in about 20 seconds, but after some time, it suddenly slow down to be completed in about 80 seconds, and we found some phenomenon as follows.

1.At that time, we observes the servers performance by the NMON utils, the results shows that each disks performance only taked up less than 60%, not up to 90% previous.

2.And we found a large number of TS files in the objects folder and async-pedding folder under each hard disk mounted directory. The contents of TS files in the async directory are all descriptions of delete requests sent to another server.

3.There are a lot of errors calling object server timeout in the proxy server log, but not in the object server. Tracking logs and finding requests for errors, whether uploaded or deleted, actually succeeded.

So we want to know whether there is an internal mechanism that can cause problems such as garbage accumulation, which can lead to a sudden drop in performance. Or because of some internal mechanisms, the cluster performance we compute does not meet the requirements. How can we solve this problem by adjusting configuration or expanding performance?

Finally, please forgive my poor English and sincerely await your reply.

edit retag flag offensive close merge delete