Ask Your Question
0

Getting 'not found' from some POST requests

asked 2013-08-01 21:11:37 -0600

jcburke gravatar image

For my use of Swift I need to check whether a file exists and update it's expiration time at the same time (or as close together as possible, doesn't need to be totally perfect), so I was planning on doing it by sending the POST request to update the expiration time and then checking whether the file exists by seeing if I got back a 202 or a 404. However, I have noticed with my cluster that I get quite a few false negatives: I have a test that uploads a file and then updates its expiration 1000 times, and on average about 3 of those attempts will return a 404. For comparison, when I run the same test and GET the file 1000 times instead of POSTing to it, I don't see any 404s (through 5 runs of the test). Is this to be expected? I know 3/1000 isn't high, but it's still enough to make me wary of relying on it since such mistakes could be rather costly in my case and there are a lot of files in play. Any idea what could be causing this? Also, I'm not sure if this is relevant, but for some reason these 404s are not evenly distributed amongst swift hosts: 2 of our 6 swift servers return about 60% of them, and one of the hosts only returns 4% of them. However, all of these hosts have the exact same configurations (we are running all of the swift services on each one) and from what I can tell receive roughly the same amount of traffic. Thanks in advance, any help will be greatly appreciated

edit retag flag offensive close merge delete

4 answers

Sort by ยป oldest newest most voted
0

answered 2013-08-01 21:52:34 -0600

clay-gerrard gravatar image

I have no idea, and find this very interesting. Please keep digging.

First, do you have object_post_as_copy = false in your proxy-server.conf (the default is true) - it dramatically effects the behavior of Swift on POST to an object.

Second, you'll need to grab the transaction_id on these requests so you can trace the logs across all the servers involved. It comes across in the HTTP headers, I think newer version of python-swiftclient make it readily available as part of the response.

I suppose I imagine the primary replicas for the server busy and connection timeouts are landing the majority of the 3 backend requests on handoff nodes - the traced request for the transaction id would tell me if this is the case.

GL!

edit flag offensive delete link more
0

answered 2013-08-01 22:36:59 -0600

jcburke gravatar image

To answer your first question, yes, we do have object_post_as_copy set to false (I asked a question about it a little while back: https://answers.launchpad.net/swift/+question/232498 (https://answers.launchpad.net/swift/+...) ). As I said in that question, we do a lot of POSTs, so turning post_as_copy back on isn't really a viable option. I'm hoping that isn't the source of this issue; unfortunately we didn't have much logging in place back when we changed that option and I'm not sure if this problem was happening before we made the change. As for your second question, I'm not seeing transaction IDs in the headers returned to me by Swift. I can see these headers in the swift logs, but it would be great if I could directly tie them to the requests that failed rather than guessing based on what the log looks like. Am I doing something wrong that is causing me not to get these headers? I have been looking for them by talking to swift using curl in verbose mode.

edit flag offensive delete link more
0

answered 2013-08-01 23:51:43 -0600

jcburke gravatar image

Here are some logs: First the file being uploaded, then an example of a transaction where one of the three POSTs failed (so the client probably didn't get a 404) and an example where all 3 failed (so the client must have gotten a 404).

Aug 1 16:32:00 swift100 object-server 172.17.10.72 - - [01/Aug/2013:23:32:00 +0000] "PUT /d1/883816/AUTH_software/buildcache_test/test" 201 - "-" "txc34cfe1ffc024885bfa9cd71b90470d2" "-" 0.1009 Aug 1 16:32:00 swift103 container-server 172.17.10.71 - - [01/Aug/2013:23:32:00 +0000] "PUT /d2/467790/AUTH_software/buildcache_test/test" 201 - "txc34cfe1ffc024885bfa9cd71b90470d2" "-" "-" 0.0006 Aug 1 16:32:00 swift104 container-server 172.17.10.76 - - [01/Aug/2013:23:32:00 +0000] "PUT /d1/467790/AUTH_software/buildcache_test/test" 201 - "txc34cfe1ffc024885bfa9cd71b90470d2" "-" "-" 0.0005 Aug 1 16:32:00 swift104 object-server 172.17.10.72 - - [01/Aug/2013:23:32:00 +0000] "PUT /d3/883816/AUTH_software/buildcache_test/test" 201 - "-" "txc34cfe1ffc024885bfa9cd71b90470d2" "-" 0.0760 Aug 1 16:32:00 swift105 container-server 172.17.10.75 - - [01/Aug/2013:23:32:00 +0000] "PUT /d2/467790/AUTH_software/buildcache_test/test" 201 - "txc34cfe1ffc024885bfa9cd71b90470d2" "-" "-" 0.0005 Aug 1 16:32:00 swift105 object-server 172.17.10.72 - - [01/Aug/2013:23:32:00 +0000] "PUT /d3/883816/AUTH_software/buildcache_test/test" 201 - "-" "txc34cfe1ffc024885bfa9cd71b90470d2" "-" 0.0757

Aug 1 16:32:05 swift101 object-server 172.17.10.73 - - [01/Aug/2013:23:32:05 +0000] "POST /d1/883816/AUTH_software/buildcache_test/test" 404 - "-" "tx98457d933aa14602a37456ca5b013d87" "-" 0.0003 Aug 1 16:32:05 swift102 proxy-server ERROR with Object server 172.17.10.71:6000/d1 re: Trying to POST /AUTH_software/buildcache_test/test: ConnectionTimeout (0.5s) (txn: tx98457d933aa14602a37456ca5b013d87) (client_ip: 172.17.23.35) Aug 1 16:32:05 swift102 proxy-server ERROR with Object server 172.17.10.73:6000/d2 re: Trying to POST /AUTH_software/buildcache_test/test: ConnectionTimeout (0.5s) (txn: tx98457d933aa14602a37456ca5b013d87) (client_ip: 172.17.23.35) Aug 1 16:32:04 swift104 object-server 172.17.10.73 - - [01/Aug/2013:23:32:04 +0000] "POST /d3/883816/AUTH_software/buildcache_test/test" 202 - "-" "tx98457d933aa14602a37456ca5b013d87" "-" 0.0024 Aug 1 16:32:04 swift105 object-server 172.17.10.73 - - [01/Aug/2013:23:32:04 +0000] "POST /d3/883816/AUTH_software/buildcache_test/test" 202 - "-" "tx98457d933aa14602a37456ca5b013d87" "-" 0.0025

Aug 1 16:32:12 swift101 object-server 172.17.10.73 - - [01/Aug/2013:23:32:12 +0000] "POST /d1/883816/AUTH_software/buildcache_test/test" 404 - "-" "tx3f5e36fd69a54690b5050dc0643f520b" "-" 0.0002 Aug 1 16:32:11 swift102 proxy-server ERROR with Object server 172.17.10.75:6000/d3 re: Trying to POST /AUTH_software/buildcache_test/test: ConnectionTimeout (0.5s) (txn: tx3f5e36fd69a54690b5050dc0643f520b) (client_ip: 172.17.23.35) Aug 1 16:32:12 swift102 proxy-server ERROR with Object server 172.17.10.76:6000/d3 re: Trying to POST /AUTH_software/buildcache_test/test: ConnectionTimeout (0.5s) (txn: tx3f5e36fd69a54690b5050dc0643f520b) (client_ip: 172.17.23.35) Aug 1 16:32:12 swift102 proxy-server ERROR with Object server 172.17.10.71:6000/d1 re: Trying to POST /AUTH_software/buildcache_test/test: ConnectionTimeout (0.5s) (txn: tx3f5e36fd69a54690b5050dc0643f520b) (client_ip: 172.17.23.35) Aug 1 16:32 ... (more)

edit flag offensive delete link more
0

answered 2013-08-02 17:57:01 -0600

clay-gerrard gravatar image

Yeah so it's the connection timeouts, pushing to handoffs which can't update an object they don't have (or don't accept an update for an object they don't have anyway, I was actually thinking about object metadata again and there may be some things we could do differently) .

You should monitor and graph timeouts and handoffs, if they get high for your work load you need to tweak your cluster.

Since it's a object server connection timeout, in this case I'd probably bump the worker count on the object servers. If I run out CPU before I get my connection timeouts down I'd maybe look at rate-limiting or increasing the timeout. Crazier stuff to play with might be reducing the timeout on the connection to the container servers to force container updates into async, or the new threadpool counts for the disk (what out, that's per disk, per worker!).

Is your benchmark scenario similar to your expected workload? You'd probably see the connection timeouts go away if you reduce your client concurrency.

edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Get to know Ask OpenStack

Resources for moderators

Question Tools

1 follower

Stats

Asked: 2013-08-01 21:11:37 -0600

Seen: 25 times

Last updated: Aug 02 '13