5 datasets found

Licenses: Other (Not Open) Tags: nlp

Filter Results
  • OpenCalais

    OpenCalais is a web service that extracts semantic metadata from text content, such as web pages.
  • Reuters-21578

    A set of documents from Reuters' 1986 newswire which have been classified. This dataset is appropriate for testing natural language processing and information retrieval...
  • The New York Times Annotated Corpus

    About From website: The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007...
  • Microsoft Web N-Gram Service

    Microsoft has developed services on the basis of ngrams from all of Bing's en_US corpus. The raw public data available include two files with the top 100k words from this...
  • Web 1T 5-gram Version 1

    This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to...
You can also access this registry using the API (see API Docs).