Scaling swift with many small documents.

asked 2016-10-30 14:35:29 -0500

sjh-ubuntu1 gravatar image

updated 2016-10-31 02:16:13 -0500

I have an application, that I intend to develop, that involves storing a lot of documents (reliability is critical.) I intend to index these documents using numerous 'tags', that will be derived from the files, and use these as the sole means of access. An initial exploratory sample, stored as files on an traditional file system, requires ~20GB (~4GB compressed) to store ~1m documents. I intend that any solution I develop should scale 20-fold immediately and be capable of scaling 1000-fold at low cost. I need low-latency operations to retrieve any file by ID and to add new files (<40ms, say). I also need to support bulk processing of the data as a 'batch'... where it would be acceptable to take several hours to process every file - for example to establish a new index of tags after an update to the tagging scheme.

My understanding of Swift is that it is intended for applications like this... but I'm unsure how best to structure my application in order to make sure it will scale. Can anyone give me an indication about how many documents I should expect to be able to store (redundantly) using a pair of very basic PCs? At what point will I need to move to a larger (dedicated) cluster? Should I be concerned about a specific configuration for the underlying OS (a particular file-system, perhaps?) Do I need to be concerned with finding ways to combine related documents in order to ensure efficient storage and low-latency retrieval?

edit retag flag offensive close merge delete

Comments

Disclaimer: Not a Swift expert. Still, the number of documents on a pair of basic PCs depends entirely on the disk sizes on those PCs. You should use XFS as filesystem type, as it allows extended attributes of any size (ext4 extended attributes seem to have a size limit).

Bernd Bausch gravatar imageBernd Bausch ( 2016-11-01 23:46:39 -0500 )edit

Thanks Bernd. If I were to put a 6TB drive in each PC (each an i5 with 16GB RAM, say) can I expect to scale about 300-fold without compression, or 1,500-fold with compression? Would I not experience latency issues sooner than that? Any problems having huge numbers of (small) files per directory?

sjh-ubuntu1 gravatar imagesjh-ubuntu1 ( 2016-11-04 03:04:19 -0500 )edit

The number of files per directory depends on the number of Swift partitions, which you define when you set up Swift. I doubt that the data volume alone will affect the latency much; the data access rate certainly will.

Bernd Bausch gravatar imageBernd Bausch ( 2016-11-05 02:33:07 -0500 )edit

Here is an (old) article by people who did use Swift for many small files: http://engineering.spilgames.com/open.... It turns out that the large number of inodes, not the directory entries, can cause trouble. Try to partition your disks to lower the number of inodes

Bernd Bausch gravatar imageBernd Bausch ( 2016-11-05 02:39:15 -0500 )edit