OK, I spent last week doing some expirements. Turing off all background consistency processes was a good suggestion.

There are 2 processes which have enough impact to cause periodic failures: 1. object-auditor. After throttling it back a lot (1MB/s, 1 file/s) it still causes about 1 problem every 18 hours. Looking at the IO graph it still periodically hits the disks heavy. I suspect this is the ZBF audit which doesn't appear to be affected by the files_per_second configuration. 2. object-replicator. With everything except the object-auditor running, I see about 1 problem every 2 hours. I don't think there's an option to throttle it.

Both these processes seem to hit each disk in succession. Ie. they'll hit sda for a while, then sdb, then sdc, and so on. Their impact would be a lot less if they switched disks more often. Their cache hits on the file metadata would also be less, though. On the other hand, overall audit/replicate performance might be larger if they distribute their load over each disk, because presumably they're spending most of their time waiting on IO.