Search for a Dataset - the Datahub

Add Dataset Import Data Package

Wikisource

Wikisource is a repository of English language text. As of October 2011, it contains over 240,000 pages. From the website Wikisource is an online library of free content...
Spinn3r Indexing the Blogosphere

Spinn3r is a web service for indexing the blogosphere. We provide raw access to every blog post being published - in real time. We provide the data, and you can focus on...
Read the Web

This data includes facts extracted from 500 million web pages. From the project's website: To build a never-ending machine learning system that acquires the ability to extract...
- TSV
- HTML
Openthesis

From the website: OpenThesis is a free repository of theses, dissertations, and other academic documents, coupled with powerful search, organization, and collaboration tools....
New Zealand Digital Library

The library is a collection of machine-readable texts and metadata, especially relating to New Zealand and the Asia/Pacific Region. From the website: [The library] provides...
Microsoft Web N-Gram Service

Microsoft has developed services on the basis of ngrams from all of Bing's en_US corpus. The raw public data available include two files with the top 100k words from this...
- ZIP
Google Books Ngram

Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the...
- CSV
Enron Email Dataset

About From distribution page: This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150...
english-gigaword

This is a recipe to train word n-gram language models using the newswire text provided in the English Gigaword corpus (1200M words of NYT, APW, AFE, XIE). It also prepares...
Beautiful Data Natural Language Corpus and Code

Ngrams and code from Dr. Peter Norvig's chapter for Beautiful Data (2009), edited by Segaran and Hammerbacher. Data files are derived from the Google Web Trillion Word Corpus,...
- ZIP

You can also access this registry using the API (see API Docs).

10 datasets found