-
Wikisource
Wikisource is a repository of English language text. As of October 2011, it contains over 240,000 pages. From the website Wikisource is an online library of free content... -
Reuters-21578
A set of documents from Reuters' 1986 newswire which have been classified. This dataset is appropriate for testing natural language processing and information retrieval... -
Read the Web
This data includes facts extracted from 500 million web pages. From the project's website: To build a never-ending machine learning system that acquires the ability to extract... -
RCV1-v2/LYRL2004
This is a publicly available, tokenized version of the Reuters RCV1 corpus by David D Lewis et al. The creator requests attribution. -
The New York Times Annotated Corpus
About From website: The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007... -
New Zealand Digital Library
The library is a collection of machine-readable texts and metadata, especially relating to New Zealand and the Asia/Pacific Region. From the website: [The library] provides... -
Microsoft Web N-Gram Service
Microsoft has developed services on the basis of ngrams from all of Bing's en_US corpus. The raw public data available include two files with the top 100k words from this... -
Web 1T 5-gram Version 1
This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to... -
Google Books Ngram
Here are the datasets backing the Google Books Ngram Viewer. These datasets were generated in July 2009; we will update these datasets as our book scanning continues, and the... -
english-gigaword
This is a recipe to train word n-gram language models using the newswire text provided in the English Gigaword corpus (1200M words of NYT, APW, AFE, XIE). It also prepares... -
Beautiful Data Natural Language Corpus and Code
Ngrams and code from Dr. Peter Norvig's chapter for Beautiful Data (2009), edited by Segaran and Hammerbacher. Data files are derived from the Google Web Trillion Word Corpus,...