Ask Your Question
0

PUT cost a lot of time

asked 2013-08-02 15:55:28 -0600

249524324-q gravatar image

When I put a large number of 4K files, the performance is terrible.

My testing configuration: [bench] concurrency = 200 object_size = 4096 num_objects = 200000 num_containers = 200

Now I trace the code of PUT operation and find some problems. I seem looks like that three step will cost a lot of time in PUT operation.

in function "ObjectController::PUT":

for chunk in iter(lambda: reader(self.network_chunk_size), ''): upload_size += len(chunk) if time.time() > upload_expiration: self.logger.increment('PUT.timeouts') return HTTPRequestTimeout(request=request) etag.update(chunk) while chunk: written = os.write(fd, chunk) chunk = chunk[written:] sleep()

per lambda:reader will cost 600ms when I put 4K files. And also eventlet.sleep() will cost 400ms. At last, fsync() and async_update() will const 400ms.

Is there anyone who had faced this problem? How to fix this issue?

edit retag flag offensive close merge delete

12 answers

Sort by ยป oldest newest most voted
0

answered 2013-08-12 02:33:02 -0600

249524324-q gravatar image

Thanks Chuck Thier, that solved my question.

edit flag offensive delete link more
0

answered 2013-08-06 22:00:14 -0600

cthier gravatar image

So a couple of thoughts:

The Timeout errors lead me to believe that you are running out of disk io.

Your object-server workers are a lot higher than I would run for a server with 4 drives. That means there are 8 workers for each drive. I would cut it back to 4-8, and only increment if further testing doesn't cause issues. If things are stable at that point, then you can look at increasing workers incrementally with further testing.

I would also look at turning your disk caches off. They are likely masking the issue at the beginning. The methods will vary by os and if you are using RAID controller, but is pretty easy to google.

The other thing that I would check is to see if the background processes are running too agressively. You could try running with replication/auditors/etc. turned off just as a test to see if that produces any effect. If so, then it would be worth tuning their concurrency, and run delays.

That should be a good start.


Chuck

edit flag offensive delete link more
0

answered 2013-08-04 16:35:46 -0600

249524324-q gravatar image

From your email to openstack-dev I gather that you have 1 proxy, 5 storage nodes, each with 4TB drives. Yes, 1 proxy and 5 storage node, but each storage have 4TB * 4 drivers.

  1. What concurrency are you running each service at (posting your configs would be great)? A: proxy workers= 32 , object-server workers = 32 account workers = 16 ,container workers = 32

  2. Are you using a RAID controller with a cache? Are the drives in a RAID configuration? A: NO

  3. Are the drive caches on or off? A: How to find it's on or off?

  4. What Filesystem are you using (XFS, EXT4, etc.) and what params are you using to format? A: XFS, inode size = 1024 and other params like "type xfs (rw,noatime,nodiratime,nobarrier,logbufs=8)".

  5. What is the partition power of your ring? A: partition power is 18. Would it affect performance?

The errors include 'ConnectTimeout' and 'get final status timeout(10s)'.

Also, how many objects do you PUT before the performance starts degrading? In fact, it started at the beginning.It kept increasingly down from 1900/s to 100/s.

some bench logs: 2013-08-04 02:32:57,281 INFO 3527 PUTS [0 failures], 1763.3/s current:1763.4/s 2013-08-04 02:33:12,282 INFO 22057 PUTS [0 failures], 1297.4/s current:1235.3/s 2013-08-04 02:33:27,284 INFO 39651 PUTS [0 failures], 1239.0/s current:1172.7/s 2013-08-04 02:33:42,291 INFO 48089 PUTS [0 failures], 1023.0/s current:562.3/s 2013-08-04 02:33:57,292 INFO 64144 PUTS [0 failures], 1034.4/s current:1070.3/s 2013-08-04 02:34:12,298 INFO 75191 PUTS [0 failures], 976.3/s current:736.2/s and then it keep between 500/s ~ 900/s for a period of time. And then it looks like: 2013-08-04 05:33:06,339 INFO 5416916 PUTS [0 failures], 501.1/s current:683.1/s 2013-08-04 05:33:21,339 INFO 5420003 PUTS [0 failures], 500.6/s current:205.8/s 2013-08-04 05:33:36,350 INFO 5429288 PUTS [0 failures], 500.8/s current:618.6/s 2013-08-04 05:33:51,381 INFO 5433035 PUTS [0 failures], 500.5/s current:249.3/s

then: 2013-08-04 10:41:24,565 INFO 12095430 PUTS [0 failures], 412.7/s current:368.1/s 2013-08-04 10:41:39,604 INFO 12099253 PUTS [0 failures], 412.6/s current:254.2/s 2013-08-04 10:41:54,606 INFO 12105207 PUTS [0 failures], 412.6/s current:396.9/s 2013-08-04 10:42:09,615 INFO 12108471 PUTS [0 failures], 412.5/s current:217.5/s

At last: 2013-08-04 13:58:32,020 INFO 15393452 PUTS [0 failures], 374.2/s current:336.8/s 2013-08-04 13:58:47,160 INFO 15395591 PUTS [0 failures], 374.1/s current:141.3/s 2013-08-04 13:59:02,172 INFO 15399491 PUTS [2 failures], 374.1/s current:259.8/s 2013-08-04 13:59 ... (more)

edit flag offensive delete link more
0

answered 2013-08-04 14:34:13 -0600

cthier gravatar image

So getting back to the PUT performance, as GET performance has always been much faster than PUT performance, as there is a bit more going on with PUTs (updating containers, fsync, etc.).

To really get down to the bottom of this, we will need to know more information about your config. From your email to openstack-dev I gather that you have 1 proxy, 5 storage nodes, each with 4TB drives.

Can you please provide the following information:

  1. What concurrency are you running each service at (posting your configs would be great)?
  2. Are you using a RAID controller with a cache? Are the drives in a RAID configuration?
  3. Are the drive caches on or off?
  4. What Filesystem are you using (XFS, EXT4, etc.) and what params are you using to format?
  5. What is the partition power of your ring?

You mention in the email that you start getting some errors like 503 when the performance starts lagging. Looking at the logs will will help you figure out what is going on (look for things like Timeouts, errors, etc.)

Also, how many objects do you PUT before the performance starts degrading?

Thanks,


Chuck

edit flag offensive delete link more
0

answered 2013-08-04 13:00:57 -0600

249524324-q gravatar image

Sorry just made it wrong "As far as I know, the object will be drop from buffer cache immediate after GET" I should be "As far as I know, the object will be drop from buffer cache immediate after PUT".

So there is no cache after PUT. And every one of the objects GET that was different. So I think those object should not be cached.

edit flag offensive delete link more
0

answered 2013-08-04 06:54:12 -0600

torgomatic gravatar image

As far as I know, the object will be drop from buffer cache immediate after GET.

Nope. There's a threshold (5 MiB by default) for dropping cache on reads. Under that threshold isn't flushed from cache.

See https://github.com/openstack/swift/bl... for details.

edit flag offensive delete link more
0

answered 2013-08-04 04:28:22 -0600

249524324-q gravatar image

As far as I know, the object will be drop from buffer cache immediate after GET.

There are some code to drop cache in DiskFile::put(swift/obj/server.py).

if 'Content-Length' in metadata: self.drop_cache(fd, 0, int(metadata['Content-Length']))

I think there is no cache with object.

edit flag offensive delete link more
0

answered 2013-08-03 22:44:54 -0600

torgomatic gravatar image

Well, you've got a bunch of tiny objects, so they're probably being read out of buffer cache on GET requests, whereas on PUT, you're guaranteed some actual disk IO.

edit flag offensive delete link more
0

answered 2013-08-03 08:06:46 -0600

249524324-q gravatar image

Thank you for your answer~ In fact, the performance of GET is very well. It can reach 5000/s stability . But the speed of PUT is keeping on 100/s stability at last.The object size is 4KB. What may cause the performance degradation of PUT while GET works well?

edit flag offensive delete link more
0

answered 2013-08-02 20:22:07 -0600

torgomatic gravatar image

It's basically because disk IO won't trigger a context switch, so when disk is slower than network, you get starvation.

See https://github.com/openstack/swift/co... for details.

edit flag offensive delete link more

Get to know Ask OpenStack

Resources for moderators

Question Tools

1 follower

Stats

Asked: 2013-08-02 15:55:28 -0600

Seen: 172 times

Last updated: Aug 12 '13