-
French TimeBank
The French TimeBank consists of a set of 109 journalistic articles from 7 different sub-genres annotated according to the ISO-TimeML standard, adapted for the French language.... -
Automated Similarity Judgment Program lexical data
ASJP collects 40 words from 5500 languages in a simplified phonetic representation. More background can be found at http://email.eva.mpg.de/~wichmann/ASJPHomePage.htm -
Phonetics Information Base and Lexicon (PHOIBLE)
Phonetics Information Base and Lexicon (PHOIBLE) is a data set of phonological inventories with additional linguistic and non-linguistic information. -
Linked Old Germanic Dictionaries
Lexical resources (word lists, etymological dictionaries) for Germanic languages in different historical stages: pre 1100 (incl. Gothic, Old High German, Old English),... -
Glottolog
Glottolog provides information about descriptive literature for all the world's languages. It also provides a language classification as well as knowledge bases for names,... -
Chat Game corpus
A corpus resulting from an object arrangement game using a computer-mediated setting. -
MExiCo
MExiCo (short for "Multimodal Experiment Corpora") is a data model for data collections containing multimodal linguistic and interaction annotations. -
FiESTA
FiESTA (short for "Format for extensive spatiotemporal annotations") is a generic format for linguistic and behavioral annotations. -
Atlante Sintattico d'Italia (ASIt)
The Atlante Sintattico d'Italia, Syntactic Atlas of Italy (ASIt) enterprise builds on a long standing tradition of collecting and analysing linguistic corpora, which has... -
Intercontinental Dictionary Series
1200 words in 200 languages -
World Loanword Database
The World Loanword Database, edited by Martin Haspelmath and Uri Tadmor, is a scientific publication by the Max Planck Digital Library, Munich (2009). It provides vocabularies... -
WikiWord
About Overview: WikiWord is a system for building a multilingual Thesaurus by extracting lexical and semantic information from Wikipedia. It was originally developed for a... -
The Speech Accent Archive
From website: The speech accent archive uniformly presents a large set of speech samples from a variety of language backgrounds. Native and non-native speakers of English read... -
Spanish Linguistic Datasets
Spanish Linguistic Datasets (SLD) is an open initiative to expose as Linked Data available Spanish Linguistic resources maintained at OEG. It is worth noting that we host... -
MOCHA-TIMIT
About Authors: Alan Wrench, Queen Margaret University College. Funded by: Engineering and Physical Sciences Research Council. When created: November 1999. Purpose:... -
Language Commons
This dataset has no description
-
Hungarian Language Corpora and Analyzers
Resources, including corpora and software, for processing Hungarian language. Language resources The Hunglish Corpus is a sentence-aligned Hungarian-English parallel corpus... -
Europarl Parallel Corpus
Description Overview from home page: The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 11 European languages:... -
english-gigaword
This is a recipe to train word n-gram language models using the newswire text provided in the English Gigaword corpus (1200M words of NYT, APW, AFE, XIE). It also prepares... -
DBpedia Spotlight
DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud... -
Corpus de Textes Linguistiques Fondamentaux (CTLF)
This database contains more than 3,000 notices on major linguistic books on grammar, from Antiquity to now. Major books will progressively be digitized and made available... -
Catalan WordNet
This dataset has no description
