Faster Cloud Storage Uploads

"I wanna go fast!" --Ricky Bobby

This quote sums up how I feel about Internet technologies. I want fast mobile apps, fast cloud storage, and fast everything.

While I was migrating several thousand small files that totaled a couple gigabytes to the cloud I saw the estimate of 12 hours show up. After about 30 minutes of transferring files I realized this was an accurate estimate for the app. After trying a couple other applications with similar estimates I decided this was unacceptable.

With about a hour and a half to burn (the length of time of a disney princess movie - parents you'll understand) I decided to figure out why these tools were slow, come up with a better method, and write something to upload my files faster. Here is what I found.

Why So Slow?

I quickly found that the tools I had were slow because they poorly handled connections. For each file they were creating a new connection. Each new connection negotiated the connection (including SSL) and had to spend time ramping up on TCP slow-start. Slow-start is not performant for lots of small files.

If you want to understand TCP slow-start better checkout the article "TCP and the Lower Bound of Web Performance" by Steve Souders.

When it comes to transferring files this was basically the worst case scenario.

Copy to HP Cloud

To implement something faster I wrote a small app called copy-to-hpcloud that uses connection sharing. Being gainfully employed by HP Cloud, I was uploading the files to object storage. To upload thousands of files I needed just 2 HTTP connections. The first to identity services to handle authentication and the second to object storage to upload the files.

This alone cut my upload time to 6 hours. Connection sharing saved me half my upload time.

Taking It Further

This is not how I actually uploaded my files. Sharing my connection was faster but sending numerous small files in a series, one at a time, isn't all that fast. It may be faster than a new connection for each file but we can do better.

The files I had were all in subdirectories with a structure like:


For each of these subdirectories I used copy-to-hpcloud to upload them. I ended up with 5 parallel connections going at the same time. While each of these processes had their own connection I ended up with just 5 connections to object storage.

The actual time to upload was about 2 hours. It was limited to the upload process that took the longest.

Two Take Aways

If anyone else has an application talking to remote storage there are two take aways that any application can implement so speed things up:

  1. Use connection sharing. For a few huge files this will not show up with any big improvements. Smaller files will see the speed bump. Negotiating connections and TCP slow-start are not cheap.
  2. Upload files in parallel where it makes sense. If you can do multiplexing use it. This is something the SPDY protocol does.

Let's make the web a little faster.