Search for a Dataset - the Datahub

Add Dataset Import Data Package

Open Bantu isiXhosa Lexicon

The isiXhosa Lexicon is an RDF dataset that consists of lexical and morphological data and English translations that are linked to WordNet RDF. The data is based on tabular noun...
- rdf/turtle
- html or rdf/xml
Ontos News Portal

The Ontos News Portal extracts facts (objects as e. g. persons or organizations as well as relations between them, e. g. a person is working for an organization or living at a...
- text/turtle
- RDF
OLiA

The Ontologies of Linguistic Annotations (OLiA) provide an OWL/DL taxonomy of data categories as a reference for linguistic annotation (OLiA Reference Model), plus OWL/DL models...
- HTML
- rdf, owl
- application/x-zip-compressed
- example/rdf+xml
ISOcat

ISO 12620 provides a framework for defining data categories compliant with the ISO/IEC 11179 family of standards. According to this model, each data category is assigned a...
- html, rdf, dcif
- example/rdf+xml
- text/ttl
TDS

Typological Database System ontology
- RDF
- HTML
General Ontology of Linguistic Description

GOLD is an ontology for descriptive linguistics.
- OWL
IATE RDF

The IATE Dataset in RDF, converted from TBX
- TXT
LemonWiktionary

Lemon data extracted from Wiktionary
- example/rdf+xml
- xhtml, rdf/xml, turtle
- text/turtle
- HTML
American National Corpus - Open Portion

This dataset has no description
- JAR
- ZIP
- GZ
Multext-East

From the web site: Version 4 of the MULTEXT-East resources, a multilingual dataset for language engineering research and development. This dataset contains, for Bulgarian,...
- text/turtle
WordNet-RDF

RDF version of WordNet from Princeton
PanLex

A lexical database documenting translations among lexemes of language varieties.
- api/sparql
- HTML
- text/ntriples
- text/turtle
ConceptNet

WordNet-like concept network developed at MIT ConceptNet aims to give computers access to common-sense knowledge, the kind of information that ordinary people know but usually...
- SQL
- HTML
WikiWord Thesaurus Data

About Overview: The WikiWord-Thesaurus is a multilingual Thesaurus derived from Wikipedia by extracting lexical and semantic information. It was originally developed for a...
TalkBank

About About TalkBank: The goal of TalkBank is to foster fundamental research in the study of human and animal communication. It will construct sample databases within each of...
Syntactic Reference Corpus of Medieval French (SRCMF)

The SRCMF contains the 15 Old French texts with about 280000 words. It has a high-quality manual annotation, based on a linguistically adequate dependency grammar. Annotation...
- HTML
- example/rdf+xml
The Rosetta Project

About From the about page: The Rosetta Project is a global collaboration of language specialists and native speakers working to build a publicly accessible digital library of...
- HTML
- RDF
- example/rdf+xml
MetaShare metadata model

Ontology Metadata as LOD Availability: Freely Avalable Usage: Status:Newly created-in progress Description: LOD prelimnary version of the MetaShare metadata model....
- text/rdf+ttl
linked hypernyms

This Linked Hypernym dataset attaches entity articles in English, German and Dutch Wikipedia with a DBpedia resource or a DBpedia ontology concept as their type. The types are...
- HTML
- application/x-ntriples
Leipzig Corpora Collection (LCC)

Deutscher Wortschatz contains data generated from newspapers and web resources that are publicly available. The data were collected per language and encompass statistics about...
- RDF
- api/sparql
- example/rdf+xml
French TimeBank

The French TimeBank consists of a set of 109 journalistic articles from 7 different sub-genres annotated according to the ISO-TimeML standard, adapted for the French language....
- OL
- ISO-TimeML
Automated Similarity Judgment Program lexical data

ASJP collects 40 words from 5500 languages in a simplified phonetic representation. More background can be found at http://email.eva.mpg.de/~wichmann/ASJPHomePage.htm
- example/rdf xml
- example/rdf+xml
- RDF
- application/x-ntriples
- meta/rdf-schema
Phonetics Information Base and Lexicon (PHOIBLE)

Phonetics Information Base and Lexicon (PHOIBLE) is a data set of phonological inventories with additional linguistic and non-linguistic information.
- datapkg/git
- HTML
- api/sparql
Linked Old Germanic Dictionaries

Lexical resources (word lists, etymological dictionaries) for Germanic languages in different historical stages: pre 1100 (incl. Gothic, Old High German, Old English),...
- HTML
- zip:ttl
Glottolog

Glottolog provides information about descriptive literature for all the world's languages. It also provides a language classification as well as knowledge bases for names,...
- zip:csv
- zip:bib
- example/rdf+xml
- application/x-ntriples
- RDF
- N3

You can also access this registry using the API (see API Docs).

26 datasets found