Tuesday, November 6, 2007

It is Time to Learn Amazon EC2 and S3

Utility computing is here to conquer the IT world! Amazon Vice President Adam Selipsky has recently unveiled the newest statistics on usage of Amazon S3. Their 265,000+ developer community has now stored over 10 billion (10,000,000,000) objects in S3 at a rate of 27,601 transactions per second. That's huge progress compared to the 5 billion objects just a few months ago!

Programmers and web developers should really look forward to books such as Programming Web Services - S3, EC2, SQS, and FPS by James Murty. The book will cover the most disruptive and useful web services available today:
  • Amazon Simple Storage Service (just launched in Europe!)
  • Amazon Elastic Compute Cloud (now in beta with new instances)
  • Amazon Simple Queue Service (offers reliable and scalable hosted queue)
  • Amazon Flexible Payments Service (still in limited beta)
To illustrate the power of the Amazon Web Services check out this blog entry by Derek Gottfrid: Self-service, Prorated Super Computing Fun!

The New York Times has decided to make all the public domain articles from 1851-1922 available free of charge. These articles are all in the form of images scanned from the original paper. In fact from 1851-1980, all 11 million articles are available as images in PDF format. To generate a PDF version of the article takes quite a bit of work!

Derek has achieved this with the help of Amazon S3/EC2 and Hadoop!

I quickly got to work copying 4TB of data to S3. Next I started writing code to pull all the parts that make up an article out of S3, generate a PDF from them and store the PDF back in S3. This was easy enough using the JetS3t — Open Source Java toolkit for S3, iText PDF Library and installing the Java Advanced Image Extension.

For deployment, I created a custom AMI (Amazon Machine Image) for EC2 that was based on a Xen image from my desktop machine. I logged in, started Hadoop and submitted a test job to generate a couple thousands articles — and to my surprise it just worked. It churned through all 11 million articles in just under 24 hours using 100 EC2 instances, and generated another 1.5TB of data to store in S3.

Now that this adventure can be called a success, I can’t imagine how we might have done it without Amazon S3 / EC2 . The one caveat I will offer to people who are interested in doing something like this is that it is highly addictive.

Indeed an interesting utility computing success story!

No comments: