Ask Your Question

imiller's profile - activity

2015-09-16 21:00:40 -0600 received badge  Famous Question (source)
2015-09-16 21:00:40 -0600 received badge  Notable Question (source)
2014-10-27 22:42:50 -0600 received badge  Popular Question (source)
2012-08-31 07:28:51 -0600 answered a question Can somebody explain "self-healing" to me ?

https://github.com/rpedde/swift-training-kick/blob/master/exercises/exercise6-ring-management.txt (https://github.com/rpedde/swift-train...)

2012-08-30 23:21:10 -0600 answered a question Can somebody explain "self-healing" to me ?

PS:

I Like reason ( I don't care what anybody does as long as there is a reason behind it - preferably one which make sense, but where I work., people get fired for having no reasons)

"This choice was made because a) most drive failures are transient (eg a new drive can be swapped in relatively quickly) and b) since replicating data out can place a higher burden on other storage nodes, an errant automatic ring update could have cascading failures throughout the cluster."

Makes perfect sense to me. And also cements my idea that nothing really changes without a ring update - operator healing. Which also makes sense to me.

What I am interested in is the effects of recovering from a failure. I think, looking at things, that if I lost all my storage node OS drives to "an attack" as long as I have my proxy alive (or all but one OS drive containing the ring files), it knows where everything is I can build OS node drives of the same IP as before and things would start to function as before pointing at their current storage drives... I do NOT want a situation where I have data, 100+'s of drives of data which I cannot access. Best practice SQLlite backups for a swift deployment ? If not around we need to try and work this.

2012-08-30 22:57:16 -0600 answered a question Can somebody explain "self-healing" to me ?

Thanks John Dickinson (notmyname) (notyourname?) ...

This is how I read it too... which puts the last 2 sheets of A4 to waste ...

So drive failure does not self heal. It becomes a black spot, where writes are diverted & replicas on that device are reduced by 1.

If the drive comes back, then it's replicas catch up by means of eventual replication. If the drive doesn't come back and is never replaced then all replicas # on that drive will always be reduced by 1 If the drive is replaced, then it is assigned the same partitions as before, but swift sees them as blank and so populates them with the data that they should hold

Which is how I thought it was. Not self healing in the way a troll would, but more self protecting in the way the starship liberator would.

How does this black hole scale with a lost OS disk ? that is an enormous amount of data dependant on the ring files and an IP.

As far as I can see the ring file references only the IP, so if I replace an OS disk, configure SWIFT with it's old IP date based replication of the disks should 'just happen' and there will be no mad rush of data as long as I don't rebuild the ring...

Which leaves the "self healing" question wide open really... So I'm gonna reopen this for a while. I'd like to leave the last pane full of fact and help :)

Thanks again JD for the sanity check post closure.

2012-08-30 22:20:49 -0600 answered a question Can somebody explain "self-healing" to me ?

Oh that's even better news. I feel ashamed I cannot seem to glean this info myself from the docs; I have wasted hours on the wrong docs whilst proofing basic functionality; now I am trying to comprehend what is going on I am as deep as I was at the start.

This is great and helps bit-rot and more likely - multi drive failure a very great deal. (I have had hard disk 'batch failure' in the past where we have lost 20% of drives over a 9 month period, hopefully a 1 in 50 year event ;) )

Samuel; once this "absolutely necessary" tertiary replication is done... if and when the missing parts of the ring come back revealing the lost partitions, will the object replicator then set about moving items to the safest locations as part of it's normal duty? Looking at it, this looks to be the case...

"Self healing" now I understand it better; from the top level docs and testing it appeared that to do anything I had to initiate ring rebuilds. Sure this helped me see things happen (like scaring a horse!) - I suspect now that if I had just sat back and waited I would have gleaned this info.

Thank you

2012-08-30 21:54:59 -0600 answered a question Can somebody explain "self-healing" to me ?

Thank you Constantine for such a swift and full answer; it is very much appreciated!

2012-08-30 20:54:41 -0600 answered a question Can somebody explain "self-healing" to me ?

PS - I really appreciate the help :)

2012-08-30 20:54:17 -0600 answered a question Can somebody explain "self-healing" to me ?

" "If a drive drive is failed , Swift does not work to replicate the data from that drive to another drive "

Incorrect. Object replicator will replicate data from one drive to other drive. "

Aha; now this is what I am missing from the docs. The docs do not mention this process occuring, it simply mentions writes bound for the failed disk going elsewhere.

So, if a failed disk is replicated in the background - this means that in a disk failure situation the swift deployment will eventually become whole again; that is that ALL data lost on the failed drive will replicate to different drives. This is is (I assume) akin to a lost drive being given zero weight.

So, Constantine, to respond to your original reply, this is self healing; if Is this true:

IF a physical machine suffers a drive failure. Swift will replicate data which was present upon this failed drive to functioning hardware to ensure every object has the correct amount of replicas. If left indefinitely in this state the system loses only the storage capacity of that drive; IMPORTANTLY : * data integrity will be exactly as if the drive was functioning. *

The above would be perfect. Scaling that up though:

"Each node is a bunch of devices. If all devices fail. Then replacing them all with some new devices on the same IP will do the trick just fine."

My question was to do with losing ONLY the OS drive(s) - Say we have 72 X 3TB disks hanging of a single server and we lose that server. All the (data) drives are fine - we can plug those into another motherboard / install a new OS disk /

Is the following true?:

1) As soon as that server goes down swift starts replicating data from all those disks to live disks 2) If left down indefinitely everything would be fine, we would just lose the capacity of those disks 3) If we enliven those disks on a newly built node with the same IP then objects replicated away will be deleted and object which are awaiting update will be updated

OR is a lost OS drive a lost node forever ?

All I can find to go on as far as procedures go is this: http://docs.openstack.org/developer/swift/admin_guide.html#cluster-telemetry-and-monitoring (http://docs.openstack.org/developer/s...) - Handling Drive Failure & Handling Server Failure - which makes it sound like swift 'works around' rather than ' repairs / heals'

I realise that on larger deployments this is of little consequence, but for the rest of us paying very high €€ for each amp we consume we have to try and work out if 5 servers with 100 disks or 50 servers with 10 disks is better. The latter is significantly higher in cost! (throughput is not an issue, only reliance)

2012-08-30 07:45:49 -0600 answered a question Can somebody explain "self-healing" to me ?

Hi, Thanks for the info Florian; very much appreciated.

This is what I understood, but; could you please clarify the following for me?

"swift starts working around the failure by doing things like writing uploads destined to that drive automatically to a handoff node"
- I have read this, and it would imply that when a single drive fails on a node; that node no longer accepts writes; this would imply that a drive failure on a single node, which is designated as a single zone renders the entire zone non writable - whether than zone contains 3 drives or 100. - So, can the 'handoff node' actually potentially be the same node but a different drive ?

-If a drive drive is failed , Swift does not work to replicate the data from that drive to another drive except for writes which would have been made to that drive. So the deployment does not 'self-heal' as such it simple works degraded until a faulty component is replaced or brought back on-line.

-If an entire node fails - and the swift install drives are toast; once the failure is rectified does the installation simply catch-up or would all those drives then be re-overwritten ? That is, since a node is defined by it's IP address so, as long as you rebuild the swift install with the same IP - no ring updates are required and only modified / new data will be copied back to that node - os this the case ?

I would love to see a document outlining different failures and how they are managed both a single server / zone per disk scenario up to a 4 server zone per server install... rather than the 'probably the best thing to do...' scenarios in the manual. I would like to know what would have to happen to lose data.

Thanks,

Isaac

PS - I understand bit-rot is self-healing, but it's Drive failure / installation failure I want to get a grasp on.

2012-08-21 15:51:32 -0600 asked a question Can somebody explain "self-healing" to me ?

Hi,

I am very close to rolling out a small Swift production environment for the purposes of backup and archiving.

Currently I am drafting maintenance and procedure docs and so I am going around in circles trying to work out what happens for any given fault and what actions to take.

I was wondering if anybody could point me in the right direction as to how Swift 'Self Heals' - it is banded about all over the place, but I am struggling to find examples.

As far as I can work out, Swift will work around faults but no actual healing will take place until the ring is updated.

For example; if a HDD fails (and gets unmounted) which contains an object to be updated - then the object will be updated on other nodes/HDD's until the failed HDD comes back or is taken out of the ring and the ring updated. This isn't self healing, this is operator healing.

Am I missing something fundamental ?

Thanks for you patience,