-
Reuters-21578
A set of documents from Reuters' 1986 newswire which have been classified. This dataset is appropriate for testing natural language processing and information retrieval... -
RCV1-v2/LYRL2004
This is a publicly available, tokenized version of the Reuters RCV1 corpus by David D Lewis et al. The creator requests attribution. -
The ClueWeb09 Dataset
The ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages...