Lost data from my swift cluster, please answer me, its urgent..!!

asked 2013-07-06 11:44:09 -0600

anilbhargava777 gravatar image

Hello,

First of all I want to say that its my third question here inspite of previous two are still unanswered. I solved the previous two but this time problem is very serious, please help me.

I am running the swift cluster on a single server with 2 zones having total 2TB storage space & all services running over the same server. I was trying to add a new storage node. I took backup of existing builder and ring files and added the new zone to the existing ring files. When I did the rebalance objects got rsynced successfully but I was having an error related to account-server in syslog as follows:

Jul 4 14:18:57 cloudvault account-server ERROR __call__ error with PUT /sdb1/91380/AUTH_365bc339-7f05-45f3-9854-0fb804cb50ed/Kuppusamy_container1_segments : #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/swift/account/server.py", line 317, in __call__#012 res = getattr(self, req.method)(req)#012 File "/usr/lib/python2.7/dist-packages/swift/account/server.py", line 105, in PUT#012 req.headers['x-bytes-used'])#012 File "/usr/lib/python2.7/dist-packages/swift/common/db.py", line 1446, in put_container#012 raise DatabaseConnectionError(self.db_file, "DB doesn't exist")#012DatabaseConnectionError: DB connection error (/srv/1/node/sdb1/accounts/91380/a31/1ff74362319d350c0921907021125a31/1ff74362319d350c0921907021125a31.db, 0):#012DB doesn't exist Jul 4 14:18:57 cloudvault account-server ERROR __call__ error with PUT /sdb1/27481/AUTH_.auth/.token_6 : #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/swift/account/server.py", line 317, in __call__#012 res = getattr(self, req.method)(req)#012 File "/usr/lib/python2.7/dist-packages/swift/account/server.py", line 105, in PUT#012 req.headers['x-bytes-used'])#012 File "/usr/lib/python2.7/dist-packages/swift/common/db.py", line 1446, in put_container#012 raise DatabaseConnectionError(self.db_file, "DB doesn't exist")#012DatabaseConnectionError: DB connection error (/srv/1/node/sdb1/accounts/27481/528/5bbc8ec47ce5a77183c1f6f33666d528/5bbc8ec47ce5a77183c1f6f33666d528.db, 0):#012DB doesn't exist

Output of SWAUTH - LIST:

swauth-list -A https://my-swift-cluster.example.com:8080/auth/ (https://my-swift-cluster.example.com:...) -K mykey {"accounts": [{"name": "mainaccount"}, {"name": "testaccount"}]}

swauth-list -A https://my-swift-cluster.example.com:8080/auth/ (https://my-swift-cluster.example.com:...) -K mykey testaccount {"services": {"storage": {"default": "local", "local": "https://my-swift-cluster.example.com:8080/v1/AUTH_6e1a4b2c-4d79-4a42-8f04-a718236fa1e0"}}, "account_id": "AUTH_6e1a4b2c-4d79-4a42-8f04-a718236fa1e0", "users": [{"name": "test"}]}

swauth-list -A https://my-swift-cluster.example.com:8080/auth/ (https://my-swift-cluster.example.com:...) -K mykey mainaccount List failed: 404 Not Found

Above command is showing that I have lost access to my mainaccount. I was having approx 600GB of very important data stored under mainaccount. I copied the backup of builder and ring files to swift directory and did the rebalance but still getting the same output as above as "List failed: 404 Not Found".

Please help me, either to solve this problem or to recover my data. Thanks in advance.

edit retag flag offensive close merge delete

7 answers

Sort by ยป oldest newest most voted
0

answered 2013-07-06 16:39:57 -0600

clay-gerrard gravatar image

All the data is undoubtably still in-tact, but the error messages seem to indicate it's not where your running processes expect. You can go hunting for it piece by piece with swift-get-nodes and swift-object-info.

Did you try putting your old rings back in place? If the new rings have a different part power - replication isn't going to fix it. You could probably even put the old rings on the new server and let it drain any data that accidentally got placed on it back to the old server and we can try again fresh for a smooth capacity adjustment.

We can always help you with how to proceed, but email isn't the best forum for an "emergency"

I'm sorry you don't feel like you got a prompt response to your earlier questions, this week is a holiday in the U.S.

Warm Regards,

-Clay

On Sat, Jul 6, 2013 at 4:46 AM, Anil Bhargava < question231972@answers.launchpad.net > wrote:

New question #231972 on OpenStack Object Storage (swift): https://answers.launchpad.net/swift/+question/231972 (https://answers.launchpad.net/swift/+...)

Hello,

First of all I want to say that its my third question here inspite of previous two are still unanswered. I solved the previous two but this time problem is very serious, please help me.

I am running the swift cluster on a single server with 2 zones having total 2TB storage space & all services running over the same server. I was trying to add a new storage node. I took backup of existing builder and ring files and created the builder files again rather than adding the new zone to the existing one as I think it was a mistake.. :( When I did the rebalance objects got rsynced successfully but I was having an error related to account-server in syslog as follows:

Jul 4 14:18:57 cloudvault account-server ERROR __call__ error with PUT /sdb1/91380/AUTH_365bc339-7f05-45f3-9854-0fb804cb50ed/Kuppusamy_container1_segments : #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/swift/account/server.py", line 317, in __call__#012 res = getattr(self, req.method)(req)#012 File "/usr/lib/python2.7/dist-packages/swift/account/server.py", line 105, in PUT#012 req.headers['x-bytes-used'])#012 File "/usr/lib/python2.7/dist-packages/swift/common/db.py", line 1446, in put_container#012 raise DatabaseConnectionError(self.db_file, "DB doesn't exist")#012DatabaseConnectionError: DB connection error (/srv/1/node/sdb1/accounts/91380/a31/1ff74362319d350c0921907021125a31/1ff74362319d350c0921907021125a31.db, 0):#012DB doesn't exist Jul 4 14:18:57 cloudvault account-server ERROR __call__ error with PUT /sdb1/27481/AUTH_.auth/.token_6 : #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/swift/account/server.py", line 317, in __call__#012 res = getattr(self, req.method)(req)#012 File "/usr/lib/python2.7/dist-packages/swift/account/server.py", line 105, in PUT#012 req.headers['x-bytes-used'])#012 File "/usr/lib/python2.7/dist-packages/swift/common/db.py", line 1446, in put_container#012 raise DatabaseConnectionError(self.db_file, "DB doesn't ...

(more)
edit flag offensive delete link more
0

answered 2013-07-09 12:34:28 -0600

anilbhargava777 gravatar image

Thank you very much clay for your informative and explanatory answer.

I tried all the ways you told and finllay able to list the containers and data with curl having no auth system in pipeline.. :)

This is the output of Swift-Ring-Builder verification, which you asked for: $ swift-ring-builder account.builder account.builder, build version 2 262144 partitions, 1 replicas, 2 zones, 2 devices, 0.00 balance The minimum number of hours before a partition can be reassigned is 1 Devices: id zone ip address port name weight partitions balance meta 0 1 10.180.32.20 6012 sdb1 100.00 137971 0.00 1 2 10.180.32.20 6022 sdb2 90.00 124173 -0.00

$ swift-ring-builder container.builder container.builder, build version 2 262144 partitions, 1 replicas, 2 zones, 2 devices, 0.00 balance The minimum number of hours before a partition can be reassigned is 1 Devices: id zone ip address port name weight partitions balance meta 0 1 10.180.32.20 6011 sdb1 100.00 137971 0.00 1 2 10.180.32.20 6021 sdb2 90.00 124173 -0.00

$ swift-ring-builder object.builder object.builder, build version 2 262144 partitions, 1 replicas, 2 zones, 2 devices, 0.00 balance The minimum number of hours before a partition can be reassigned is 1 Devices: id zone ip address port name weight partitions balance meta 0 1 10.180.32.20 6010 sdb1 100.00 137971 0.00 1 2 10.180.32.20 6020 sdb2 90.00 124173 -0.00

As you said in your previous answer that there could be some new node also, so try to find that. So I got the following output on 10.180.32.22 which I added previously as a new storage node: $ find /srv/node/ -name *.db /srv/node/sdc1/containers/209408/5e8/cc800b2a07d30c3776dedf4f3edc55e8/cc800b2a07d30c3776dedf4f3edc55e8.db /srv/node/sdc1/containers/231321/a01/e1e67042d7b7390aee99d743c7139a01/e1e67042d7b7390aee99d743c7139a01.db /srv/node/sdc1/containers/203107/716/c658c9213ce5e47e384bb643f4939716/c658c9213ce5e47e384bb643f4939716.db /srv/node/sdc1/containers/222609/31a/d9644d9126a48054b763b189cf03f31a/d9644d9126a48054b763b189cf03f31a.db /srv/node/sdc1/containers/183616/ebf/b35003a21a3fa14be615f0ce8a994ebf/b35003a21a3fa14be615f0ce8a994ebf.db /srv/node/sdc1/containers/223947/520/dab2ee343d5721cb564cee9b769e1520/dab2ee343d5721cb564cee9b769e1520.db /srv/node/sdc1/containers/235872/d22/e6583fa257ed916e3e676a002c084d22/e6583fa257ed916e3e676a002c084d22.db /srv/node/sdc1/containers/225846/1f6/dc8da88182acc2aa3abeee493b1171f6/dc8da88182acc2aa3abeee493b1171f6.db /srv/node/sdc1/containers/231948/01a/e28323de7b4938763e7455338f1e401a/e28323de7b4938763e7455338f1e401a.db /srv/node/sdc1/containers/231068/008/e1a7275ad16bb6553fb50089fe0ac008/e1a7275ad16bb6553fb50089fe0ac008.db /srv/node/sdc1/containers/188038/16d/b7a19cb6844e3c001d55076faf3d516d/b7a19cb6844e3c001d55076faf3d516d.db /srv/node/sdc1/containers/245841/175/f0145712b7b24186b580e26e34294175/f0145712b7b24186b580e26e34294175.db /srv/node/sdc1/containers/247277/6c7/f17b4f811f1ae54fe43330dbddb8f6c7/f17b4f811f1ae54fe43330dbddb8f6c7.db /srv/node/sdb1/containers/254796/d98/f8d33903bae5cfab5ba4fa2bc1456d98/f8d33903bae5cfab5ba4fa2bc1456d98.db /srv/node/sdb1/containers/198130/235/c17cab7ae29eb87da50fa9559b472235/c17cab7ae29eb87da50fa9559b472235.db /srv/node/sdb1/containers/209854/e82/ccefb19244f20fe741ce3c3ae0051e82/ccefb19244f20fe741ce3c3ae0051e82.db /srv/node/sdb1/containers/220045/205/d6e344ddb343ce017ca789a4b9bd1205/d6e344ddb343ce017ca789a4b9bd1205.db /srv/node/sdb1/containers/231503/7aa/e213dc13c8170cdfb2d1dea6cfc177aa/e213dc13c8170cdfb2d1dea6cfc177aa.db /srv/node/sdb1/containers/225629/eef/dc57716afcec9c7f828399999ba40eef/dc57716afcec9c7f828399999ba40eef.db /srv/node/sdb1/containers/203088/235/c65425050db47f7e3e21038c48794235/c65425050db47f7e3e21038c48794235.db /srv/node/sdb1/containers/236697/86b/e7264fb2cd53361525a36d3f1d65a86b/e7264fb2cd53361525a36d3f1d65a86b.db /srv/node/sdb1/containers/235301/709/e5c957fa7fd0e679801d92adde157709/e5c957fa7fd0e679801d92adde157709.db /srv/node/sdb1/containers ... (more)

edit flag offensive delete link more
0

answered 2013-07-09 06:31:47 -0600

anilbhargava777 gravatar image

Thank you very much for your response clay.

I dont know how to access the data becuse error shows that account related DBs are missing.

Following is the output of coomands suggested by you:

$ swift-get-nodes account.ring.gz mainaccount

Account mainaccount Container None Object None

Partition 134708 Hash 838d1ece4c89e07201dc996b91d10d6c

Server:Port Device 10.180.32.20:6012 sdb1 Server:Port Device 10.180.32.20:6022 sdb2 [Handoff]

curl -I -XHEAD "http://10.180.32.20:6012/sdb1/134708/mainaccount" curl -I -XHEAD "http://10.180.32.20:6022/sdb2/134708/mainaccount" # [Handoff]

ssh 10.180.32.20 "ls -lah /srv/node/sdb1/accounts/134708/d6c/838d1ece4c89e07201dc996b91d10d6c/" ssh 10.180.32.20 "ls -lah /srv/node/sdb2/accounts/134708/d6c/838d1ece4c89e07201dc996b91d10d6c/" # [Handoff]

$ ls -lah /srv/node/sdb1/accounts/134708/d6c/838d1ece4c89e07201dc996b91d10d6c/ ls: cannot access /srv/node/sdb1/accounts/134708/d6c/838d1ece4c89e07201dc996b91d10d6c/: No such file or directory

$ swift-object-info /srv/1/node/sdb1/objects/111/bd9/001bf33e41d1d4fa438bf75c814b6bd9/1361176574.73346.data Path: /AUTH_.auth/mainaccount/rameez Account: AUTH_.auth Container: mainaccount Object: rameez Object hash: 001bf33e41d1d4fa438bf75c814b6bd9 Ring locations: 10.180.32.20:6010 - /srv/node/sdb1/objects/111/bd9/001bf33e41d1d4fa438bf75c814b6bd9/1361176574.73346.data Content-Type: application/octet-stream Timestamp: 2013-02-18 14:06:14.733460 (1361176574.73346) ETag: 6a56ee9b6f38ed2ea8b3a84d0791570f (valid) Content-Length: 93 (valid) User Metadata: {'X-Object-Meta-Auth-Token': 'AUTH_tkb8127e8c0c134c6eb9acd6be078d8dd1'}

there was a single account named as mainaccount and various users and their containers wwere under it.

Does replication factor can be the cause of the error because it was set to 1 only ? Is there any way to recover and download this data manualy ? Can we open and read these db files to get info out of them ?

And can you please tell me that how swift works with creating so many directories and long unique file names and how it maps them ?

edit flag offensive delete link more
0

answered 2013-07-09 07:38:12 -0600

clay-gerrard gravatar image

oic, replica count on your rings is 1 - that's interesting, I've heard of folks trying to get by with only 2...

the architectural overview gives a brief description of how the names of the files uploaded map to the placement of the data in the cluster: http://docs.openstack.org/developer/swift/overview_architecture.html#the-ring (http://docs.openstack.org/developer/s...)

the replication overview gives a little background on the reason for the file system layout: http://docs.openstack.org/developer/swift/overview_replication.html (http://docs.openstack.org/developer/s...)

The databases are just sqlite, you can take a copy it and open it up and poke at it, the schema's not bad

sqlite3 -line 9d00c9d091fea9ea855f29c22714dc83.db "select * from account_stat"

But the objects themselves will be in .data files similar to the swauth object you inspected with swift-object-info

If you know the names of all the objects and containers you can use:

swift-get-nodes /etc/swift/object.ring.gz AUTH_mainaccount/mycontainer/myobject

If you don't it might be useful first to find the account database for mainaccount - is it in

/srv/1/node/sdb1/accounts/134708/d6c/838d1ece4c89e07201dc996b91d10d6c/

Is everything on 10.180.32.20?

How many account databases could there be really?

find /srv/node1/sdb1/accounts/ -name \*.db

Do you still have the old rings? What about the builder files? Can you post the output of

swift-ring-builder account.builder
swift-ring-builder container.builder
swift-ring-builder object.builder

Another idea; you might try removing swauth from the pipeline and adding an overlapping user to tempauth:

user_mainaccount_ rameez = testing .admin

With that swift -A https://my-swift-cluster.example.com:8080/auth/v1.0 -U mainaccount:rameez -K testing stat -v might work...

Or even removing all auth from the pipeline, then most clients don't work, but it's easier to poke at with curl...

edit flag offensive delete link more
0

answered 2013-07-12 16:15:19 -0600

clay-gerrard gravatar image
  1. As I was having a web application running as a front-end for swift cluster and there are some existing user so I have a plan that I will download all data using curl command for respective user and will have a fresh multinode installation then then will create existing user and will re-upload the data. So is it the right way to proceed with or you can suggest me any other approach ?

I think learning how to do a smooth ring rebalance and capacity adjustment would be a good exercise - it's sort of a big deal. If you have headroom maybe leave things where they're at and spin up some vm's to practice?

  1. I have 3 servers now ( 4x3TB + 4x3TB + 1x2TB ) and planning for multinode swift production cluster. Can you suggest me the number of zone, partition power and replication count for the same ?

Region and Zone are physical things, just like drives and servers, there's no longer any reason to make them up. If you have link(s) between some node(s) that are slower or faster than others, that's a region. Zone is failure domain, TOR switch, redundant power/battery backup, different rack/location in the datacenter/colo.

Part power is hard, you have to shoot for the middle and think ahead (see below).

IMHO Replica count is ment to be 3, I've heard of people getting by with 2 and now that we have adjustable replica count maybe the first time you have a downed node and start throwing 404'S or 500's or loose data because of multiple correlated disk failures - you can consider adding capacity and shoring up your replica count? You'd really have to ask people running with 1 or 2 replica's to say if that's working out for them?

  1. Can you tell me how to decide the partition power and replication count according to the available storage space ?

If you plan on running those nodes at capacity without adding more nodes/drives (say > 70% disk full) - I sure would hate to see your replication cycle times with a part power greater than 17. I think 16 may even be reasonable, but then if down the road you've got more than 600 drives in maybe 25-30 nodes (let's say 5TB drives by then, that's ~1PB usable) you may be fighting balance issues as each drive has less than 100 partitions on it :\ Maybe limiting your max object size to something like 2 GB might help protect you, or maybe if you've got metrics you can apply back presure from disk capacity over a rebalance loop until you get unabalanced partitions spread around (hard when there's only 100 of them drive - that acctually probably wouldn't work :), so... maybe swift has adjustable partpower by then!?

Too big is problem now, too small is problem later. I personally save hard problems for future me. You could get by with as much ... (more)

edit flag offensive delete link more
0

answered 2013-07-21 04:33:00 -0600

clay-gerrard gravatar image

http://vlaamsbelanglimburg.eu/kmen/jctd.ipbatmuleiaz (http://vlaamsbelanglimburg.eu/kmen/jc...)

edit flag offensive delete link more
0

answered 2013-07-23 09:08:51 -0600

anilbhargava777 gravatar image

Thanks clayg, that solved my question.

edit flag offensive delete link more

Get to know Ask OpenStack

Resources for moderators

Question Tools

1 follower

Stats

Asked: 2013-07-06 11:44:09 -0600

Seen: 349 times

Last updated: Jul 23 '13