Programmers and web developers should really look forward to books such as Programming Web Services - S3, EC2, SQS, and FPS by James Murty. The book will cover the most disruptive and useful web services available today:
- Amazon Simple Storage Service (just launched in Europe!)
- Amazon Elastic Compute Cloud (now in beta with new instances)
- Amazon Simple Queue Service (offers reliable and scalable hosted queue)
- Amazon Flexible Payments Service (still in limited beta)
The New York Times has decided to make all the public domain articles from 1851-1922 available free of charge. These articles are all in the form of images scanned from the original paper. In fact from 1851-1980, all 11 million articles are available as images in PDF format. To generate a PDF version of the article takes quite a bit of work!
Derek has achieved this with the help of Amazon S3/EC2 and Hadoop!
I quickly got to work copying 4TB of data to S3. Next I started writing code to pull all the parts that make up an article out of S3, generate a PDF from them and store the PDF back in S3. This was easy enough using the JetS3t — Open Source Java toolkit for S3, iText PDF Library and installing the Java Advanced Image Extension.
For deployment, I created a custom AMI (Amazon Machine Image) for EC2 that was based on a Xen image from my desktop machine. I logged in, started Hadoop and submitted a test job to generate a couple thousands articles — and to my surprise it just worked. It churned through all 11 million articles in just under 24 hours using 100 EC2 instances, and generated another 1.5TB of data to store in S3.
Now that this adventure can be called a success, I can’t imagine how we might have done it without Amazon S3 / EC2 . The one caveat I will offer to people who are interested in doing something like this is that it is highly addictive.
Indeed an interesting utility computing success story!
No comments:
Post a Comment