Manually Annotated Sub-Corpus (MASC) of the Open American National Corpus

The Manually Annotated Sub-Corpus (MASC) consists of approximately 500,000 words of contemporary American English written and spoken data drawn from the OPEN AMERICAN NATIONAL CORPUS (OANC).

All of MASC includes manually validated annotations for sentence boundaries, token, lemma and POS; noun and verb chunks; and named entities (person, location, organization, date). Additional manually produced or validated annotations have been produced by the MASC project for portions of the sub-corpus, including full-text annotation for FrameNet frame elements, WordNet sense tags, and Penn Treebank syntactic annotation. Annotations of all or portions of the sub-corpus for a wide variety of other linguistic phenomena have been contributed by other projects.

Unlike most freely available corpora including a wide variety of linguistic annotations, MASC contains texts from a broad range of genres.

MASC is an OPEN LANGUAGE DATA resource that can be downloaded by anyone for any purpose. At the same time, it is a resource that will be enhanced by the community, through its contributions of annotations and derived data.

The RDF conversion from GrAF is still in preparation. Below, you can find a preliminary OWL2/DL representation of the PennTreebank syntax (60.000 tok) in OWL, generated out of the original annotations, instead. Triple counts and links refer to this fragment. The conversion follows POWLA specifications (http://sourceforge.net/projects/powla/) and includes links to OLiA Annotation Models for PTB morphosyntax and syntax.

Download Data Package

Data and Resources

MASC v. 1.0.3ZIP
MASC1 data and annotations (v 1.0.3), ANC (GrAF) format

More information Go to resource
MASC v. 3.0.0application/x-tgz
original formats and GrAF XML

More information Go to resource
MASC v. 3.0.0 (PTB syntax only), OWL2/DLapplication/x-tgz
provisional conversion of the PTB-annotated MASC subsection, 60.000 tokens,...

More information Go to resource

Additional Info

Field	Value
Source	http://www.anc.org/MASC/Home.html
Maintainer	American National Corpus maintainer
Last Updated	July 29, 2014, 09:25 (UTC)
Created	September 16, 2012, 18:27 (UTC)
links:olia	67000
tokens	62000
triples	1000000