Can stuffed elephants and large lakes run on many many computers? Read on.
I've been hearing a lot about Hadoop, so I decided to read up on it. Its an interesting project of distributed proportions :-) From the wikipedia page:
Apache Hadoop is a Java software framework that supports data intensive distributed applications, free licensed.[1] It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google's MapReduce and Google File System (GFS) papers.
Hadoop is a top level Apache project, being built and used by a community of contributors from all over the world.[2] Yahoo! has been the largest contributor[3] to the project and uses Hadoop extensively in its web search and advertising businesses.[4] IBM and Google have announced a major initiative to use Hadoop to support university courses in distributed computer programming.[5]
Hadoop was created by Doug Cutting (now a Cloudera employee)[6], who named it after his child's stuffed elephant. It was originally developed to support distribution for the Nutch search engine project.[7]
Stuffed elephant. Nice! I wonder if it can be helpful for our Moodle/iLearn install at SF State.
Then there is Tahoe, a p2p like file system, that I came across, courtesy of Jason Stone (thanks, Jason). Tahoe is a p2p style file system that oversamples the data and breaks it up into 10 pieces. These pieces are stored across multiple machines, and in case we lose some of these pieces, the original can be recovered from three out of the ten pieces. This magic is done using a variation of the Reed-Solomon Error Correction method, which is also widely used in CDs. From their website at Allmydata (which is where this project originally comes from):
The "Tahoe" project is a distributed filesystem, which safely stores files on multiple machines to protect against hardware failures. Cryptographic tools are used to ensure integrity and confidentiality, and a decentralized architecture minimizes single points of failure. Files can be accessed through a web interface or native system calls (via FUSE). Fine-grained sharing allows individual files or directories to be delegated by passing short URI-like strings through email. Tahoe grids are easy to set up, and can be used by a handful of friends or by a large company for thousands of customers.
Now, if only I had access to a large number of machines to run this all on...
|