Ask Your Question
0

How to recover from a "disc full" error

asked 2011-05-17 08:18:01 -0500

rtb gravatar image

I provoked a "disc full" error on an object node's data disc. As expected, the error was logged on a PUT and the affected object ended up with 2 copies (replicas) instead of three.

My question: other than hunting for those messages in /var/log/messages, what process or command would have warned me that out of my zillion objects there are which aren't 100% healthy?

A related question: after having made space for it, what process is actually responsible for re-creating the missing copy?

edit retag flag offensive close merge delete

7 answers

Sort by » oldest newest most voted
0

answered 2011-06-17 19:19:38 -0500

lardcanoe gravatar image

I was experimenting with swift-bench (for whatever reason the trunk of 1.4.1-dev didn't have swift-dispersion) and I had it fill up all the devices in all zones. Once full, the rest of the PUTs seemed to fail correctly. However, swift-bench crashed during the GET phase, and left all the data. So i loaded up 'st' to manually remove the data only to find that I can't!

Trying to remove a container: Container DELETE failed: https://192.168.14.190:8080/v1/AUTH_d2e67314-2a18-45e9-bde1-7b2b3d6da553/d60f7b13e70344f6a87bf8cbf1766207_0 (https://192.168.14.190:8080/v1/AUTH_d...) 409 Conflict

Trying to remove a single object: Object DELETE failed: https://192.168.14.190:8080/v1/AUTH_d2e67314-2a18-45e9-bde1-7b2b3d6da553/d60f7b13e70344f6a87bf8cbf1766207_0/04ef828997f24ecb8d2fa77100810720 (https://192.168.14.190:8080/v1/AUTH_d...) 503 Internal Server Error

Tailing syslogd on the proxy: Jun 17 15:08:45 proxy-boston-01 proxy-server Object DELETE returning 503 for (503, 503, 503) Jun 17 15:08:45 proxy-boston-01 proxy-server 192.168.14.194 192.168.14.194 17/Jun/2011/19/08/45 DELETE /v1/AUTH_d2e67314-2a18-45e9-bde1-7b2b3d6da553/d60f7b13e70344f6a87bf8cbf1766207_0/04ef828997f24ecb8d2fa77100810720 HTTP/1.0 503 - - system%3Aroot%2CAUTH_tk5d0a015fc03c4dc090fe300b9f44ffb1 - - - - - 0.0064

Tailing syslog on one of the storage nodes: Jun 17 15:11:51 storage-z1-001 object-server ERROR __call__ error with REPLICATE /sdd1/16527 : [Errno 28] No space left on device: '/srv/node/sdd1/objects/16527/tmpUMFt7t.tmp' (well, this error is thrown basically every second...)

So from my perspective, I'm SOL unless I actually go to the storage node and manually delete data, which is scary.

Just because best practice says to always keep at least 80% full, doesn't mean the system shouldn't be able to handle it getting full...

edit flag offensive delete link more
0

answered 2011-06-17 21:28:15 -0500

gholt gravatar image

Well, there are limitations to anything. In this case, to make a delete happen there has to be enough space for a zero-byte file to be recorded. If your cluster gets completely full, you need to add capacity before being able to continue. It's a really bad idea to ever let a cluster to get to that point.

I would not manually delete anything in such as a case as a completely full cluster. I would add new capacity and let the cluster stabilize before using the usual API to issue deletes.

edit flag offensive delete link more
0

answered 2011-05-18 00:35:26 -0500

gholt gravatar image

swift-(account|container|object)-replicator handle replication.

I don't think we've reached file system full before; we always add space once we hit 80%.

Scanning the logs for errors is a way to be notified of problems. There is currently no utility to audit an entire cluster to determine which objects are under replicated. Once space is added, the rsync processes launched by the mentioned replicators should bring everything back into sync.

edit flag offensive delete link more
0

answered 2011-05-24 13:00:35 -0500

letterj gravatar image

If a device errors out then system will try and write an object to the next available device in the list assuming you have more zones than the number of replicas . As long as you have that device in the ring, during every replication pass the replicator will try and move the data back to it's original position.

The best way to avoid this problem is to monitor available disk capacity. Depending on your use case plan to add capacity when your devices get to 80% full.

edit flag offensive delete link more
0

answered 2011-05-24 13:18:28 -0500

rtb gravatar image

Thanks Jay Payne, that solved my question.

edit flag offensive delete link more
0

answered 2011-05-24 13:49:48 -0500

letterj gravatar image

Rainer,

Let me verify my answer in our lab and get back to you.

One of the developers sent me a note that a full disk may not be treated as an "unmounted" drive and perform a "handoff". I want to say that we tested this specific scenario but again let me verify.

It may take me a couple of days to get the results.

edit flag offensive delete link more
0

answered 2011-05-24 15:19:10 -0500

rtb gravatar image

Thanks for looking, however experimentally you're correct: it causes an error which then indeed creates a "handoff" entry. Through which magic I did not investigate.

Le 24 mai 2011 à 15:51, Jay Payne a écrit :

Your question #157863 on OpenStack Object Storage (swift) changed: https://answers.launchpad.net/swift/+question/157863 (https://answers.launchpad.net/swift/+...)

Jay Payne posted a new comment: Rainer,

Let me verify my answer in our lab and get back to you.

One of the developers sent me a note that a full disk may not be treated as an "unmounted" drive and perform a "handoff". I want to say that we tested this specific scenario but again let me verify.

It may take me a couple of days to get the results.


You received this question notification because you asked the question.

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Get to know Ask OpenStack

Resources for moderators

Question Tools

1 follower

Stats

Asked: 2011-05-17 08:18:01 -0500

Seen: 1,725 times

Last updated: Jun 17 '11