So I put some more load on the cluster, and am back to one failure every two hours.

What's the general recommendation on retries? These failures happen infrequently enough that I bet I can just retry and it'll work. The extra delay is not a big problem for my application. Is this a good idea?