Amazon Web Services hosts a wide variety of public data sets that anyone can access for free through AWS services.

Previously, large data sets such as the mapping of the Human Genome required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets and analyze them using Amazon EC2 instances or Amazon EMR (Hosted Hadoop) clusters. By hosting this important data where it can be quickly and easily processed with elastic computing resources, AWS hopes to enable more innovation, more quickly.

For help in using these, check out the Public Data Sets Forum hosted by Amazon. This support forum includes a thread where you can make Public Data Set and tool requests. Before requesting something, be sure to check out the Public Data Set Directory.

Directory Highlights

The directory currently contains 56 data sets and is growing.  The sets are categorized into Astronomy, Biology, Chemistry, Climate, Economics, Encyclopedic, Geographic, and Mathematics.  A few of the more interesting sets include:

  • Astronomy: Sloan Digital Sky Survey DR6 Subset – The Sloan Digital Sky Survey is the most ambitious astronomical survey ever undertaken. The survey has mapped one-quarter of the entire sky in detail, determining the positions and absolute brightnesses of hundreds of millions of celestial objects. It has also measured the distances (redshifts) to more than a million galaxies and quasars.
  • Biology: 1000 Genomes Project – The 1000 Genomes Project aims to build the most detailed map of human genetic variation, ultimately with data from the genomes of over 2,600 people from 26 populations around the world. The data contained within this release include results from sequencing the DNA of approximately first 1,700 of over 2,600 people.
  • Chemistry:  PubChem Library – PubChem provides information on the biological activities of small molecules. It is a component of NIH’s Molecular Libraries Roadmap Initiative.
  • Climate: NASA NEX – NASA NEX is a collaboration and analytical platform that combines state-of-the-art supercomputing, Earth system modeling, workflow management and NASA remote-sensing data. Through NEX, users can explore and analyze large Earth science data sets, run and share modeling algorithms, collaborate on new or existing projects and exchange workflows and results within and among other science communities.
  • Economics: Federal Contracts from the Federal Procurement Data Center – This data set is a dump from the Federal Procurement Data Center (FPDC), which manages the Federal Procurement Data System (FPDS-NG). FPDS-NG collects and disseminates procurement data — or information about contracts that the federal government gives to private companies. The FPDS-NG summarizes who bought what, from whom, and where.
  • Encyclopedic: Google Books Ngrams – A data set containing Google Books n-gram corpuses. This data set is freely available on Amazon S3 in a Hadoop friendly file format and is licensed under a Creative Commons Attribution 3.0 Unported License.
  • Encyclopedic: Wikipedia XML Data – A complete copy of all Wikimedia wikis, in the form of wikitext source and metadata embedded in XML.


If you would rather play with data in Google Fusion Tables (now available in Google Drive), as I do, you can find many datasets to work with in the Google Public Data Directory.

Upper Image: Churning Out Stars,  Photoconductor Array Camera and Spectrometer, Spectral and Photometric Imaging Receiver