Revision history [back]

click to hide/show revision 1
initial version

Adding a storage node device to Swift generates errors

After installing a 4 node Swift cluster (10.10.6.27 as a proxy node and 10.10.6.28,10.10.6.29and 10.10.6.30 as Storage Nodes) with Packstack I followed this guide to add another storage device (on a new node - 10.10.6.34). Unfortunately, I keep getting these multiple errors when I run tail -f /var/log/messages (BTW is this really the best way to debug problems on a node?):

Apr 23 16:18:13 host-10-10-6-28 rsyncd[3291]: name lookup failed for 10.10.6.29: Name or service not known
Apr 23 16:18:13 host-10-10-6-28 rsyncd[3291]: connect from UNKNOWN (10.10.6.29)
Apr 23 16:18:14 host-10-10-6-28 rsyncd[3291]: rsync to object/device1/objects/136720 from UNKNOWN (10.10.6.29)
Apr 23 16:18:14 host-10-10-6-28 rsyncd[3291]: receiving file list
Apr 23 16:18:14 host-10-10-6-28 rsyncd[3291]: rsync: recv_generator: mkdir "/device1/objects/136720/84b/85843027182d33309feccf85b13eb84b" (in object) failed: Permission denied (13)
Apr 23 16:18:14 host-10-10-6-28 rsyncd[3291]: *** Skipping any contents from this failed directory ***
Apr 23 16:18:14 host-10-10-6-28 rsyncd[3291]: sent 234 bytes  received 234 bytes  total size 16
Apr 23 16:18:14 host-10-10-6-28 xinetd[493]: EXIT: rsync status=0 pid=3291 duration=1(sec)
Apr 23 16:18:17 host-10-10-6-28 journal: container-replicator Beginning replication run
Apr 23 16:18:17 host-10-10-6-28 journal: container-replicator ERROR reading HTTP response from {'replication_port': 6001, 'zone': 4, 'weight': 2.0, 'ip': '10.10.6.34', 'region': 1, 'port': 6001, 'replication_ip': '10.10.6.34', 'meta': u'', 'device': 'device4', 'id': 3}: Connection refused
Apr 23 16:18:17 host-10-10-6-28 journal: container-replicator Replication run OVER
Apr 23 16:18:17 host-10-10-6-28 journal: container-replicator Attempted to replicate 1 dbs in 0.01321 seconds (75.70615/s)
Apr 23 16:18:17 host-10-10-6-28 journal: container-replicator Removed 0 dbs
Apr 23 16:18:17 host-10-10-6-28 journal: container-replicator 2 successes, 1 failures
Apr 23 16:18:17 host-10-10-6-28 journal: container-replicator no_change:2 ts_repl:0 diff:0 rsync:0 diff_capped:0 hashmatch:0 empty:0
Apr 23 16:18:22 host-10-10-6-28 journal: account-replicator Beginning replication run
Apr 23 16:18:22 host-10-10-6-28 journal: account-replicator ERROR reading HTTP response from {'replication_port': 6002, 'zone': 4, 'weight': 2.0, 'ip': '10.10.6.34', 'region': 1, 'port': 6002, 'replication_ip': '10.10.6.34', 'meta': u'', 'device': 'device4', 'id': 3}: Connection refused
Apr 23 16:18:22 host-10-10-6-28 journal: account-replicator Replication run OVER
Apr 23 16:18:22 host-10-10-6-28 journal: account-replicator Attempted to replicate 1 dbs in 0.01319 seconds (75.80878/s)
Apr 23 16:18:22 host-10-10-6-28 journal: account-replicator Removed 0 dbs
Apr 23 16:18:22 host-10-10-6-28 journal: account-replicator 2 successes, 1 failures
Apr 23 16:18:22 host-10-10-6-28 journal: account-replicator no_change:2 ts_repl:0 diff:0 rsync:0 diff_capped:0 hashmatch:0 empty:0
Apr 23 16:18:27 host-10-10-6-28 dhclient[439]: DHCPREQUEST on eth0 to 10.10.6.8 port 67 (xid=0x30076f67)
Apr 23 16:18:27 host-10-10-6-28 dhclient[439]: DHCPACK from 10.10.6.8 (xid=0x30076f67)
Apr 23 16:18:27 host-10-10-6-28 journal: object-auditor Begin object audit "forever" mode (ZBF)
Apr 23 16:18:27 host-10-10-6-28 journal: object-auditor ERROR auditing: #012Traceback (most recent call last):#012  File "/usr/lib/python2.7/site-packages/swift/obj/auditor.py", line 247, in run_forever#012    self.run_once(**kwargs)#012  File "/usr/lib/python2.7/site-packages/swift/obj/auditor.py", line 258, in run_once#012    worker.audit_all_objects(mode=mode)#012  File "/usr/lib/python2.7/site-packages/swift/obj/auditor.py", line 79, in audit_all_objects#012    for path, device, partition in all_locs:#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1540, in audit_location_generator#012    files = sorted(listdir(hash_path), reverse=True)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1813, in listdir#012    return os.listdir(path)#012OSError: [Errno 13] Permission denied: '/srv/node/device1/objects/136720/84b/85843027182d33309feccf85b13eb84b'
Apr 23 16:18:27 host-10-10-6-28 journal: object-auditor Begin object audit "forever" mode (ALL)
Apr 23 16:18:27 host-10-10-6-28 journal: object-auditor ERROR auditing: #012Traceback (most recent call last):#012  File "/usr/lib/python2.7/site-packages/swift/obj/auditor.py", line 247, in run_forever#012    self.run_once(**kwargs)#012  File "/usr/lib/python2.7/site-packages/swift/obj/auditor.py", line 258, in run_once#012    worker.audit_all_objects(mode=mode)#012  File "/usr/lib/python2.7/site-packages/swift/obj/auditor.py", line 79, in audit_all_objects#012    for path, device, partition in all_locs:#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1540, in audit_location_generator#012    files = sorted(listdir(hash_path), reverse=True)#012  File "/usr/lib/python2.7/site-packages/swift/common/utils.py", line 1813, in listdir#012    return os.listdir(path)#012OSError: [Errno 13] Permission denied: '/srv/node/device1/objects/136720/84b/85843027182d33309feccf85b13eb84b'
Apr 23 16:18:29 host-10-10-6-28 dhclient[439]: bound to 10.10.6.28 -- renewal in 44 seconds.
Apr 23 16:18:30 host-10-10-6-28 journal: object-replicator Starting object replication pass.
Apr 23 16:18:30 host-10-10-6-28 journal: object-replicator Error syncing with node: {'replication_port': 6000, 'zone': 4, 'weight': 2.0, 'ip': '10.10.6.34', 'region': 1, 'port': 6000, 'replication_ip': '10.10.6.34', 'meta': u'', 'device': 'device4', 'id': 3}: Connection refused

As you can see rsync fails to create a folder in device 1 (no idea why because it seems to have nothing to do with the new node - keep in mind this didn't happened before adding the node) because of permissions issues (but the folder and its content belong to the swiftuser).

Then I get a connection refused on the new node. Iptables is flushed and the policy is to ACCEPT. netstat -plnt shows that, as expected, there is a python program listening on 6000, 6001 and 6002 ports.

[root@host-10-10-6-34 ~]# netstat -plnt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.1:6000          0.0.0.0:*               LISTEN      9344/python         
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      9320/python         
tcp        0      0 127.0.0.1:6001          0.0.0.0:*               LISTEN      9326/python         
tcp        0      0 127.0.0.1:6002          0.0.0.0:*               LISTEN      9327/python         
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      503/sshd            
tcp6       0      0 :::22                   :::*                    LISTEN      503/sshd

Then I get a new problem with permission denied in the object-auditor (even though, as I said, the owner of /srv/node/device1 and its contents belong to the swift user) and another connection refused in the object-replicator.

Can you help me understand what I'm doing wrong (I'll provide more info if needed)?

On a bit of a side note I saw some connection refused problems before adding the new node, after the installation and after adding a test file to the cluster but it seem to have stabilised after some seconds. Is this normal behaviour (maybe a node trying too many connections to get the new objects as fast as possible and saturate the node - even though the test file was in the order of a few bytes) or this shouldn't happen at all? I took a look at the packstack installed nodes and they seem to have the maximum rsync connections set up to 25 while the guide I followed only allowed 2 so I went ahead and changed it to 25 in the hopes of getting a solution for those connection refused problems without any luck.

I'm sorry for the wall of text but I'm trying to create a procedure to follow later on a production environment and I'm having problems finding useful information to have a stable Swift cluster where I can easily add and remove devices.

Finally, do you believe that having Packstack deploying only Swift is the way to go or should I just deploy that and Keystone manually?

Thank you in advance.