Tuesday, March 25, 2008

Yahoo Search Architecture Based on Hadoop

How to implement internet scale search based on open-source technologies?

Yahoo has solved this problem using the open-source Apache Hadoop distributed computing framework. Hadoop provides a distributed filesystem (HDFS) and support for the MapReduce distributed computing metaphor that is also the foundation of Google's architecture.

The Yahoo! Search Webmap Hadoop application runs on a more than 2,000 node Linux cluster each with two quad core processors and produces data that is now used in every Yahoo! Web search query.

The Webmap build starts with every Web page crawled by Yahoo! and produces a database of all known Web pages and sites on the internet and a vast array of data about every page and site. This derived data feeds the Machine Learned Ranking algorithms at the heart of Yahoo! Search.

Some Webmap size data:
  • Number of links between pages in the index: roughly 1 trillion links
  • Size of output: over 300 TB, compressed!
  • Number of cores used to run a single Map-Reduce job: over 10,000
  • Raw disk used in the production cluster: over 5 Petabytes
Smaller companies who cannot afford to run their own Hadoop cluster can combine it with the Amazon EC2 ans S3 cloud computing service. The result? Open source utility computing on Google scale!

No comments: