Sunday, January 20, 2008

Google to Host Terabytes of Open-Source Science Data

[From a posting by Randy Burge on Dewayne Hendricks list --BSA]

Google to Host Terabytes of Open-Source Science Data
By Alexis Madrigal
January 18, 2008 | 2:23:21 PM

Categories: Dataset, Research

Sources at Google have disclosed that the humble domain,
, will soon provide a home for terabytes of open-source scientific
datasets. The storage will be free to scientists and access to the
data will be free for all. The project, known as Palimpsest and first
previewed to the scientific community at the Science Foo camp at the
Googleplex last August, missed its original launch date this week, but
will debut soon.

Building on the company's acquisition of the data visualization
technology, Trendalyzer, from the oft-lauded, TED presenting Gapminder
team, Google will also be offering algorithms for the examination and
probing of the information. The new site will have YouTube-style
annotating and commenting features.

The storage would fill a major need for scientists who want to openly
share their data, and would allow citizen scientists access to an
unprecedented amount of data to explore. For example, two planned
datasets are all 120 terabytes of Hubble Space Telescope data and the
images from the Archimedes Palimpsest, the 10th century manuscript
that inspired the Google dataset storage project.

UPDATE (12:01pm): Attila Csordas of Pimm has a lot more details on the
project, including a set of slides that Jon Trowbridge of Google gave
at a presentation in Paris last year. WIRED's own Thomas Goetz also
mentioned the project in his fantastic piece of freeing dark data.

One major issue with science's huge datasets is how to get them to
Google. In this post by a SciFoo attendee over at business|bytes|genes|
molecules, the collection plan was described:

(Google people) are providing a 3TB drive array (Linux RAID5). The
array is provided in “suitcase” and shipped to anyone who wants to
send they data to Google. Anyone interested gives Google the file
tree, and they SLURP the data off the drive. I believe they can extend
this to a larger array (my memory says 20TB).

You can check out more details on why hard drives are the preferred
distribution method at Pimm. And we hear that Google is hunting for
cool datasets, so if you have one, it might pay to get in touch with