Thursday, March 29, 2012

Critical role for R&E networks+commercial clouds in US government Big Data initiative

[It is great to see US and European governments undertake initiatives to promote the development of research into Big Data utilizing commercial clouds.
Many cloud providers are offering free resources to support these initiatives. R&E networks will play a critical role in linking researchers to the commercial clouds and developing collaboration platforms and portals. The recent Apache-Rave announcement in partnership with XSEDE and COmanage in the US and SURFconext in Netherlands is a great example of developing “Research as a Service” using commercial clouds . See Ian Foster presentation. Peering with commercial cloud providers will also be critical.

I have long argued that development of commercial clouds to support research will fundamentally change cyber-infrastructure at universities. As Dr Ed Lazowska commented in a New York Times article: “The need to analyze vast amounts of data from a broad array of sensors is going to be far more pervasive than the use of numerical simulation - even though the use of numerical simulation continues to increase. Even in fields such as national security and scientific discovery, for decades the flagships for HPC, large-scale data analysis is growing to equal importance. And this requires entirely different hardware and software architectures than does traditional HPC. “ HPC will remain an important niche, but analyzing large volumes of data is ideally suited for commercial clouds.

Many people have argued for public funded academic clouds. The big disadvantage of an academic cloud is that it requires new infrastructure updates every few years in order to meet ongoing demand for additional computation resources. So the situation, from a funding council perspective is an ongoing requirement to continuously upgrade computer resources whether they stand alone systems or are lumped together within an academic cloud. But with commercial clouds funding agencies do not have to purchase infrastructure to enable researchers to use these facilities. Commercial clouds make the necessary investment to upgrade their infrastructure over time as demand warrants. Many commercial cloud providers spend hundreds of millions per year on computer upgrades – which dwarfs the annual expenditure most funding councils spend on HPC facilities.

Many R&E networks are providing brokered commercial cloud services which will further reduce cost of using clouds (for those that are not free) – BSA]

Aiming to make the most of the fast-growing volume of digital data, the Obama
Administration today announced a “Big Data Research and Development Initiative.” By
improving our ability to extract knowledge and insights from large and complex
collections of digital data, the initiative promises to help solve some the Nation’s most
pressing challenges.
To launch the initiative, six Federal departments and agencies today announced more
than $200 million in new commitments that, together, promise to greatly improve the
tools and techniques needed to access, organize, and glean discoveries from huge
volumes of digital data.
National Institutes of Health – 1000 Genomes Project Data Available on Cloud:
The National Institutes of Health is announcing that the world’s largest set of data on
human genetic variation – produced by the international 1000 Genomes Project – is
now freely available on the Amazon Web Services (AWS) cloud. At 200 terabytes – the
equivalent of 16 million file cabinets filled with text, or more than 30,000 standard DVDs
– the current 1000 Genomes Project data set is a prime example of big data, where
data sets become so massive that few researchers have the computing power to make
best use of them. AWS is storing the 1000 Genomes Project as a publically available
data set for free and researchers only will pay for the computing services that they use.

Accessing 1000 Genomes Data

AWS is making the 1000 Genomes Project data publicly available to the community free of charge. Public Data Sets on AWS provide a centralized repository of public data hosted on Amazon Simple Storage Service (Amazon S3). The data can be seamlessly accessed from AWS services such Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic MapReduce (Amazon EMR), which provide organizations with the highly scalable compute resources needed to take advantage of these large data collections. AWS is storing the public data sets at no charge to the community. Researchers pay only for the additional AWS resources they need for further processing or analysis of the data. Learn more about Public Data Sets on AWS.

All 200 TB of the latest 1000 Genomes Project data is available in a publicly available Amazon S3 bucket.
You can access the data via simple HTTP requests, or take advantage of the AWS SDKs in languages such as Ruby, Java, Python, .NET and PHP.

Educators, researchers and students can apply for free credits to take advantage of the utility computing platform offered by AWS, along with Public Datasets such as the 1000 Genomes Project data. If you're running a genomics workshop or have a research project which could take advantage of the hosted 1000 Genomes dataset, you can apply for an AWS Grant.

Apache RAVE with XSEDE and SURFconext annoucement

R&E Network and Green Internet Consultant.
twitter: BillStArnaud
skype: Pocketpro