Search for a Dataset - the Datahub

Add Dataset Import Data Package

USPTO Patent data

Linked Data version of the US Patent and Trademark Office (USPTO) data. Number of triples: 212,234,735. Number of resources: 3,215,768 Links to other datasets: DBpedia,...
- example/ntriples
- SPARQL
- text/rdf+ttl
- application/x-ntriples
- owl, ontology, meta/owl
- HTML
DBpedia abstract corpus

This corpus contains a conversion of Wikipedia abstracts in six languages (dutch, english, french, german, italian and spanish) into the I used the NLP Interchange Format (NIF)....
- GZ
- text/turtle
LODStats

LODStats: The Data Web Census Dataset.
- ttl
- nt
- api/sparql
- nt.tar.bz2
SemanticQuran

The Semantic Quran dataset is a multilingual RDF representation of translations of the Quran. The dataset was created by integrating data from two different semi-structured...
- gz:ttl
- gz:ttl:owl
- PDF
GWPP Glossary

The GWPP glossary is a set of scientific terms and their definitions that are used inside the Global Water Pathogen Project online book. This dataset is crowdsourced by a large...
- CSV
- RDF
- TTL
Lidioms

the LIDIOM dataset is a multilingual RDF representation of idioms containing five languages. The data set was crawled and integrated from various sources. For assuring the...
- HTML
- text/turtle
LinkLion - A Link Repository for the Web of Data

LinkLion is an open-source central repository for the storage of links among resources in the Linked Open Data web. The main goal of LinkLion is to facilitate the publication,...
- api/sparql
Linked TCGA

Linked TCGA is the RDF version of the Cancer Genome Atlas, a pilot project started in 2005 by the National Cancer Institute (NCI) and the National Human Genome Research...
- api/sparql
JRC-Names-MLODE

From their web site: JRC-Names is a highly multilingual named entity resource for person and organisation names (called 'entities'). It consists of large lists of names and...
- gzip
- text/turtle
- gz:nt
- api/sparql
- example/turtle
Caucasian Spiders

The Caucasian Spiders Database aims at containing all records (published occurrences) of spiders (Araneae) in the Caucasus Ecoregion (the rayons Krasnodar and Stavropol in...
- gzip:text/sql
- application/sql+gzip
CORDIS corpus

CORDIS (Community Research and Development Information Service), is the European Commission’s core public repository providing dissemination information for all EU-funded...
- GZ
CORDIS

todo
- RDF
aksw.org Research Group dataset

This dataset contains projects, sub groups, people and pages or the Agile Knowledge Management and Semantic Web (AKSW) Research Group @ University of Leipzig.
- text/turtle
- RDF
- example/turtle
- HTML
- api/sparql
KORE 50 NIF NER Corpus

KORE 50[1] (AIDA) is a subset of the larger AIDA corpus, which is based on the dataset of the CoNLL 2003 NER task. The dataset aims to capture hard to disambiguate mentions of...
- text/turtle
- PDF
ORCID

ORCID (Open Researcher and Contributor ID) is a nonproprietary alphanumeric code to uniquely identify scientific and other academic authors. This dataset contains RDF conversion...
- text/turtle
- GZ
Statbel Corpus

This corpus contains RDF conversion of datasets from the "Statistics Belgium" (also known as Statbel) which aims at collecting, processing and disseminating relevant, reliable...
- text/turtle
Global airports in RDF

This corpus contains RDF conversion of Global airports dataset which was retrieved from openflights.org. The dataset contains information about airport names, its location,...
- text/turtle
Lion's Den

Lion's Den is a RDF repository of link specifications. Lion's Den is intended to be an open community-driven dataset that allows data publishers to also publish their...
- text/turtle
- ttl
- sparql endpoint
- RDF
LSQ

Linked SQ: a Linked Dataset describing SPARQL queries extracted from the logs of a variety of prominent public SPARQL endpoints. We argue that this dataset has a variety of uses...
- turtle
Brown Corpus in RDF/NIF

RDF version of the Brown Corpus (W. N. Francis, H. Kucera; Brown University; 1979). 1,014,312 words in 500 documents, taken from newspapers texts on diverse topics, non-fiction...
- text/turtle
- example/turtle
MLSA - A Multi-layered Reference Corpus for German Sentiment Analysis

Sentence-layer annotation represents the most coarse-grained annotation in this corpus. We adhere to definitions of objectivity and subjectivity introduced in (Wiebe et al.,...
- PNG
- text/n3
- api/sparql
- example/turtle
SentimentWortschatz

SentimentWortschatz, or SentiWS for short, is a publicly available German-language resource for sentiment analysis, opinion mining etc. It lists positive and negative polarity...
- zip:8c
- text/turtle
- api/sparql
- example/turtle
Wikilinks RDF/NIF

The Wikilinks corpus is a coreference resolution corpus of very large scale. It contains over 40 million mentions of over 3 million entities. Mentions are manually labeled links...
- example/turtle
- GZ
- CSV
News-100 NIF NER Corpus

This corpus comprises 100 German news articles from the online news platform news.de. All of the articles were published in the year of 2010 and contain the word Golf. This word...
- text/turtle
- PDF
RSS-500 NIF NER CORPUS

This corpus has been created using a dataset comprising a list of 1,457 RSS feeds as compiled in (Goldhahn et al. 2012). The list includes all major worldwide newspapers and a...
- text/turtle
- PDF

You can also access this registry using the API (see API Docs).

34 datasets found