-
Web Tables
This page provides a large corpus of HTML tables for public download. The corpus has been extracted from the 2012 version of the Common Crawl and contains 147 million relational... -
Hyperlink Graph
The latest graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our... -
RDFa, Microdata, and Microformat Data Set
More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages using markup standards such as...