Search for a Dataset - the Datahub

Add Dataset Import Data Package

French TimeBank

The French TimeBank consists of a set of 109 journalistic articles from 7 different sub-genres annotated according to the ISO-TimeML standard, adapted for the French language....
- OL
- ISO-TimeML
Automated Similarity Judgment Program lexical data

ASJP collects 40 words from 5500 languages in a simplified phonetic representation. More background can be found at http://email.eva.mpg.de/~wichmann/ASJPHomePage.htm
- example/rdf xml
- example/rdf+xml
- RDF
- application/x-ntriples
- meta/rdf-schema
Phonetics Information Base and Lexicon (PHOIBLE)

Phonetics Information Base and Lexicon (PHOIBLE) is a data set of phonological inventories with additional linguistic and non-linguistic information.
- datapkg/git
- HTML
- api/sparql
Linked Old Germanic Dictionaries

Lexical resources (word lists, etymological dictionaries) for Germanic languages in different historical stages: pre 1100 (incl. Gothic, Old High German, Old English),...
- HTML
- zip:ttl
Glottolog

Glottolog provides information about descriptive literature for all the world's languages. It also provides a language classification as well as knowledge bases for names,...
- zip:csv
- zip:bib
- example/rdf+xml
- application/x-ntriples
- RDF
- N3
Chat Game corpus

A corpus resulting from an object arrangement game using a computer-mediated setting.
- text/turtle
MExiCo

MExiCo (short for "Multimodal Experiment Corpora") is a data model for data collections containing multimodal linguistic and interaction annotations.
- text/turtle
- example/turtle
FiESTA

FiESTA (short for "Format for extensive spatiotemporal annotations") is a generic format for linguistic and behavioral annotations.
- text/turtle
Atlante Sintattico d'Italia (ASIt)

The Atlante Sintattico d'Italia, Syntactic Atlas of Italy (ASIt) enterprise builds on a long standing tradition of collecting and analysing linguistic corpora, which has...
- RDF
- XML
Intercontinental Dictionary Series

1200 words in 200 languages
- PNG
- text/n3
- api/sparql
- example/turtle
World Loanword Database

The World Loanword Database, edited by Martin Haspelmath and Uri Tadmor, is a scientific publication by the Max Planck Digital Library, Munich (2009). It provides vocabularies...
- RDF
- example/rdf+xml
- api/sparql
- text/n3
WikiWord

About Overview: WikiWord is a system for building a multilingual Thesaurus by extracting lexical and semantic information from Wikipedia. It was originally developed for a...
The Speech Accent Archive

From website: The speech accent archive uniformly presents a large set of speech samples from a variety of language backgrounds. Native and non-native speakers of English read...
Spanish Linguistic Datasets

Spanish Linguistic Datasets (SLD) is an open initiative to expose as Linked Data available Spanish Linguistic resources maintained at OEG. It is worth noting that we host...
- api/sparql
MOCHA-TIMIT

About Authors: Alan Wrench, Queen Margaret University College. Funded by: Engineering and Physical Sciences Research Council. When created: November 1999. Purpose:...
Language Commons

This dataset has no description
Hungarian Language Corpora and Analyzers

Resources, including corpora and software, for processing Hungarian language. Language resources The Hunglish Corpus is a sentence-aligned Hungarian-English parallel corpus...
- index/ftp
- gz:txt
- tgz
Europarl Parallel Corpus

Description Overview from home page: The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 11 European languages:...
english-gigaword

This is a recipe to train word n-gram language models using the newswire text provided in the English Gigaword corpus (1200M words of NYT, APW, AFE, XIE). It also prepares...
DBpedia Spotlight

DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud...
- application/x-ntriples
Corpus de Textes Linguistiques Fondamentaux (CTLF)

This database contains more than 3,000 notices on major linguistic books on grammar, from Antiquity to now. Major books will progressively be digitized and made available...
Catalan WordNet

This dataset has no description
- tar.gz

You can also access this registry using the API (see API Docs).

172 datasets found