Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

Long delay during job submission using Hadoop on OpenStack/Swift

We are experimenting with submitting jobs to our Hadoop cluster running on OpenStack VMs, using data located in our Swift object storage. When doing so we are noticing a long delay during the job submission process when the client is determining the number of input splits. The time to submit the job to the cluster seems to scale linearly with the number of files submitted for input, especially when using a wildcard '*' in the Swift path.

Response times from our command line python-swiftclient for individual list/stat calls are sub-second so we don't think there are HTTP/Swift or network/storage related delays at play.

We believe we have tracked down the offending code to the interaction of the FileInputFormat's getSplits() method with the SwiftNativeFileSystem implementation. It appears that depending which version of the OpenStack SwiftFS implementation we try (< 3.0 or 3.0) and how we specify the input path, it makes between 2-3 head calls to Swift per input file. When attempting to run a job with 1200+ input files, the sheer number of HTTP REST calls to Swift is causing a delay of 6+ minutes before the client actually submits the job to the cluster.

We would like to know if anyone is experiencing similar delays submitting jobs to the cluster with a large number of input files using Hadoop on Swift, and if so, are you able to work around this and how.