RSS-500 NIF NER CORPUS

This corpus has been created using a dataset comprising a list of 1,457 RSS feeds as compiled in (Goldhahn et al. 2012). The list includes all major worldwide newspapers and a wide range of topics, e.g., World , U.S. , Business , Sci- ence etc. The RSS list has been compiled using a 76-hour crawl, which resulted in a corpus of about 11.7 million sen- tences. A subset of this corpus has been created by ran- domly selecting 1% of the contained sentences. Finally, one researcher annotated 500 randomly chosen sentences manually. These sentences were a subset of those which contained a natural language representation of a formal relation, like “. . . , who was born in. . . ” for dpo:birthPlace (see (Gerber and Ngomo, 2012)). The relations had to occur more than 5 times in the 1% corpus. In case the mentioned entity is not contained in a new URI has been generated. This corpus has been used for evalua- tion purposes in (Gerber et al., 2013)

Download Data Package

Data and Resources

RSS-500 Corpus in Turtletext/turtle
Complete corpus file in turtle format

More information Go to resource
Documentation paperPDF
Title: N3 - A Collection of Datasets for Named Entity Recognition and...

More information Go to resource
DataIDtext/turtle
Metadata description of the corpus

More information Go to resource

Additional Info

Field	Value
Author	Ricardo Usbeck
Maintainer	Ricardo Usbeck
Last Updated	October 29, 2014, 16:27 (UTC)
Created	September 5, 2014, 07:26 (UTC)
github	https://github.com/AKSW/n3-collection
homepage	http://aksw.org/Projects/N3NERNEDNIF.html
links:dbpedia	524
triples	10038