RSS-500 NIF NER CORPUS

This corpus has been created using a dataset comprising a list of 1,457 RSS feeds as compiled in (Goldhahn et al. 2012). The list includes all major worldwide newspapers and a wide range of topics, e.g., World , U.S. , Business , Sci- ence etc. The RSS list has been compiled using a 76-hour crawl, which resulted in a corpus of about 11.7 million sen- tences. A subset of this corpus has been created by ran- domly selecting 1% of the contained sentences. Finally, one researcher annotated 500 randomly chosen sentences manually. These sentences were a subset of those which contained a natural language representation of a formal relation, like “. . . , who was born in. . . ” for dpo:birthPlace (see (Gerber and Ngomo, 2012)). The relations had to occur more than 5 times in the 1% corpus. In case the mentioned entity is not contained in a new URI has been generated. This corpus has been used for evalua- tion purposes in (Gerber et al., 2013)

Data and Resources

Additional Info

Field Value
Author Ricardo Usbeck
Maintainer Ricardo Usbeck
Last Updated October 29, 2014, 16:27 (UTC)
Created September 5, 2014, 07:26 (UTC)
github https://github.com/AKSW/n3-collection
homepage http://aksw.org/Projects/N3NERNEDNIF.html
links:dbpedia 524
triples 10038