A corpus of web crawl data composed of 5 billion web pages.

A corpus of web crawl data composed of 5 billion web pages. This data set is freely available on Amazon S3 at s3://aws-publicdatasets/common-crawl/crawl-002/ and formatted in the ARC (.arc) file format.

Common Crawl is a non-profit organization that builds and maintains an open repository of web crawl data for the purpose of driving innovation in research, education and technology. This data set contains web crawl data from 5 billion web pages and is released under the Common Crawl Terms of Use.

Download Data Package

Data and Resources

About the Common Crawl Corpusapplication/download
A 1-pager describing the corpus, its format, link to terms of use, what you...

More information Go to resource

Additional Info

Field	Value
Source	http://aws.amazon.com/datasets/41740
Author	Common Crawl
Last Updated	October 10, 2013, 20:20 (UTC)
Created	May 9, 2012, 23:13 (UTC)