Thursday, March 6, 2008

Cyber-infrastructure cloud tools for social scientists

[Here is a great example of using cyber-infrastructure cloud tools for social science applications. The NY Times project is a typical of many social science projects where thousands of documents must be digitized and indexed. The cost savings compared to operating a cluster are impressive. Also it is exciting to see the announcement from NSF to promote industrial research partnership with Google and IBM on clouds. Thanks to Glen Newton for this pointer -- BSA]

Hadoop + EC2 + S3 = Super alternatives for researchers (& real people too!)

I recently discovered and have been inspired by a real-world and non-trivial (in space and in time) application of Hadoop (Open Source implementation of Google's MapReduce) combined with the Amazon Simple Storage Service (Amazon S3) and the Amazon Elastic Compute Cloud (Amazon EC2). The project was to convert pre-1922 New York Times articles-as-scanned-TIFF-images into PDFs of the articles:

4 TB of data loaded to S3 (TIFF images)
+ Hadoop (+ Java Advanced Imaging and various glue)
+ 100 EC2 instances
+ 24 hours
= 11M PDFs, 1.5 TB on S3

Unfortunately, the developer (Derek Gottfrid) did not say how much this cost the NYT. But here is my back-of-the-envelope calculation (using the Amazon S3/EC2 FAQ):

EC2: $0.10 per instance-hour x 100 instances x 24hrs = $240
S3: $0.15 per GB-Month x 4500 GB x ~1.5/31 months = ~$33
+ $0.10 per GB of data transferred in x 4000 GB = $400
+ $0.13 per GB of data transferred out x 1500 GB = $195
Total: = ~$868

Not unreasonable at all! Of course this does not include the cost of bandwidth that the NYT needed to upload/download their data.

I've known about the MapReduce and Hadoop for quite a while now, but this is the first use outside of Google (MapReduce) and Yahoo (Hadoop) and combined with Amazon services that I've such a real problem solved so smoothly and also wasn't web indexing or toy examples.

As much of my work in information retrieval and knowledge discovery involves a great deal of space and even more CPU, I am looking forward to experimenting with this sort of environment (Hadoop, local or in a service cloud) for some of the more extreme experiments I am working on. And by using Hadoop locally, if the problem gets to big for our local resources, we can always buy capacity like the NYT example with a minimum of effort!

This is also something that various commercial organizations (and even individuals?) with specific high CPU / high storage / high bandwidth (oh, transfers between S3 and EC2 are free) compute needs should be considering this solution. Of course security and privacy concerns apply.

Breaking News:
NSF Teams w/ Google, IBM for Academic 'Cloud' Access

Feb. 25 -- Today, the National Science Foundation's Computer and Information Science and Engineering (CISE) Directorate announced the creation of a strategic relationship with Google Inc. and IBM. The Cluster Exploratory (CluE) relationship will enable the academic research community to conduct experiments and test new theories and ideas using a large-scale, massively distributed computing cluster.

In an open letter to the academic computing research community, Jeannette Wing, the assistant director at NSF for CISE, said that the relationship will give the academic computer science research community access to resources that would be unavailable to it otherwise.

"Access to the Google-IBM academic cluster via the CluE program will provide the academic community with the opportunity to do research in data-intensive computing and to explore powerful new applications," Wing said. "It can also serve as a tool for educating the next generation of scientists and engineers."

"Google is proud to partner with the National Science Foundation to provide computing resources to the academic research community," said Stuart Feldman, vice president of engineering at Google Inc. "It is our hope that research conducted using this cluster will allow researchers across many fields to take advantage of the opportunities afforded by large-scale, distributed computing."

"Extending the Google/IBM academic program with the National Science Foundation should accelerate research on Internet-scale computing and drive innovation to fuel the applications of the future," said Willy Chiu, vice president of IBM software strategy and High Performance On Demand Solutions. "IBM is pleased to be collaborating with the NSF on this project."

In October of last year, Google and IBM created a large-scale computer cluster of approximately 1,600 processors to give the academic community access to otherwise prohibitively expensive resources. Fundamental changes in computer architecture and increases in network capacity are encouraging software developers to take new approaches to computer-science problem solving. In order to bridge the gap between industry and academia, it is imperative that academic researchers are exposed to the emerging computing paradigm behind the growth of "Internet-scale" applications.

This new relationship with NSF will expand access to this research infrastructure to academic institutions across the nation. In an effort to create greater awareness of research opportunities using data-intensive computing, the CISE directorate will solicit proposals from academic researchers. NSF will then select the researchers to have access to the cluster and provide support to the researchers to conduct their work. Google and IBM will cover the costs associated with operating the cluster and will provide other support to the researchers. NSF will not provide any funding to Google or IBM for these activities.

While the timeline for releasing the formal request for proposals to the academic community is still being developed, NSF anticipates being able to support 10 to 15 research projects in the first year of the program, and will likely expand the number of projects in the future.

Information about the Google-IBM Academic Cluster Computing Initiative can be found at

According to Wing, NSF hopes the relationship may provide a blueprint for future collaborations between the academic computing research community and private industry. "We welcome any comparable offers from industry that offer the same potential for transformative research outcomes," Wing said.


Source: National Science Foundation