-
OpenCalais
OpenCalais is a web service that extracts semantic metadata from text content, such as web pages. -
Reuters-21578
A set of documents from Reuters' 1986 newswire which have been classified. This dataset is appropriate for testing natural language processing and information retrieval... -
The New York Times Annotated Corpus
About From website: The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007... -
Microsoft Web N-Gram Service
Microsoft has developed services on the basis of ngrams from all of Bing's en_US corpus. The raw public data available include two files with the top 100k words from this... -
Web 1T 5-gram Version 1
This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to...