A corpus of web crawl data composed of 5 billion web pages.

A corpus of web crawl data composed of 5 billion web pages. This data set is freely available on Amazon S3 at s3://aws-publicdatasets/common-crawl/crawl-002/ and formatted in the ARC (.arc) file format.

Common Crawl is a non-profit organization that builds and maintains an open repository of web crawl data for the purpose of driving innovation in research, education and technology. This data set contains web crawl data from 5 billion web pages and is released under the Common Crawl Terms of Use.

Data and Resources

Additional Info

Field Value
Source http://aws.amazon.com/datasets/41740
Author Common Crawl
Last Updated October 10, 2013, 20:20 (UTC)
Created May 9, 2012, 23:13 (UTC)