DBpedia abstract French corpus

This corpus contains a conversion of Wikipedia abstracts in six languages (dutch, english, french, german, italian and spanish) into the I used the NLP Interchange Format (NIF). The corpus contains the abstract texts, as well as the position, surface form and linked article of all links in the text. As such, it contains entity mentions manually disambiguated to Wikipedia/DBpedia resources by native speakers, which predestines it for NER training and evaluation. Furthermore, the abstracts represent a special form of text that lends itself to be used for more sophisticated tasks, like open relation extraction. Their encyclopedic style, following Wikipedia guidelines on opening paragraphs adds further interesting properties. The first sentence puts the article in broader context. Most anaphers will refer to the original topic of the text, making them easier to resolve. Finally, should the same string occur in different meanings, Wikipedia guidelines suggest that the new meaning should again be linked for disambiguation. In short: The type of text is highly interesting.

Data and Resources

Additional Info

Field Value
Source https://datahub.io/dataset/dbpedia-abstract-corpus
Author InfAI
Maintainer Milan Dojchinovski
Last Updated January 22, 2016, 10:14 (UTC)
Created January 18, 2016, 22:09 (UTC)
Language http://lexvo.org/id/iso639-3/fra