XML standards for language corpora (review)

Document-driven and data-driven, standoff and inline

First of all, the intention of the encoding can be different. Richard Eckart summarizes two main trends: document-driven XML and data-driven XML. While the first uses an « inline approach » and is « usually easily human-readable and meaningful even without the annotations », the latter is « geared towards machine processing and functions like a database record. […] The order of elements often is meaningless. » (Eckart 2008 p. 3)

In fact, several choices of architecture depend on the goal of an annotation using XML. The main division regards standoff and inline XML (also : stand-off and in-line).

The Paula format (“Potsdamer Austauschformat für linguistische Annotation”, ‘Potsdam Interchange Format for Linguistic Annotation’) chose both approaches. So did Nancy Ide for the ANC Project, a series of tools enable the users to convert the data between well-known formats (GrAF standoff, GrAF inline, GATE or UIMA). This versatility seems to be a good point, since you cannot expect corpus users to change their habits just because of one single corpus. Regarding the way standoff and inline annotation compare, (Dipper et al. 2007) found that the inline format (with pointers) performs better.

A few trends in linguistic research

Speaking about trends in the German research during the last decade, (Woerner et al. 2006) see three main approaches (p. 1) :

the timeline-based stand-off format Exmaralda (Schmidt 2004)
the hierarchical format Tusnelda that is based on the TEI [Sperberg-McQueen and Burnard 1994]
Paula that resembles the Linguistic Annotation Framework [Ide et al. 2003]

Among them, Paula seems interesting :

« The interchange format PAULA has been developed for empirical, data-based research on information structure, a linguistic phenomenon that involves various linguistic levels, such as syntax, phonology, semantics. As a consequence, the data which serve as the basis of this research are marked up with different kinds of annotations: syntax trees or graphs, segment-based phonological properties, etc. The annotations are created by means of different, task-specific annotation tools » (Woerner et al. 2006 p. 5)

The Linguistic Annotation Framework has been developed by Nancy Ide and Laurent Romary, who had already worked on the XCES ISO standard. The accomplishments and models can be seen at work in the American National Corpus (ANC).

The TEI. developed an very different and maybe much more complete annotation framework than XCES, although their approaches are similar.

Known issues

The variety of features which can be annotated is a challenge per se. (Witt et al. 2009) document one serious issue : the crossing edges in an XML graph.

« Linguistically annotated corpora may contain crossing edges and, thus, require a data structure that is more complex than a simple tree. » (p. 364)

Thus, they try to show how multi-rooted trees can be represented « in an integrated way, by using the TEI tag set for the annotation of feature structures. » (p. 365)

Given the diversity of formats, one of the main goals should be to ensure interoperability. That is where complying to a standard has a few advantages, described by (Romary 2009):

« As soon as the corpus to be digitized is planned to be disseminated to a wider audience, one should make sure that the documentation of the corpus objects, both from a library point of view (meta-data, source identification, etc.) and a technical point of view (schema), is adequate for their autonomous processing by third-party users; » (p. 3) « the definition a finite set of features and corresponding practices is somehow simplified, with very little room for encoding overkill. Still, since the corpus of texts is a constantly evolving matter, there is a need for defining a workflow for constant updating of the underlying schema;» (p. 5)

Rehm et al. (2009) focus on the fact that interoperability is not as easy as it seems. First of all, data normalization should considered as an important feature, because beyond the mere practical issues, models can be put into question.

« The aspect of data normalization is rarely discussed in academic publications. This is mostly due to the fact that the conversion from one format into another is not regarded as a difficult or challenging task, because tools and specialized programming languages exist that support researchers in converting data sets from one format into another. In practice, however, it turns out that this rather time-consuming task is in fact of interest for researchers within Digital Humanities. The reason is that the specification of transformations may change the model according to which a text resource is annotated. » (p. 201)

Finally, Romary describes a theoretical issue on the linguistic side: the compatibility of linguistic frameworks is everything but guaranteed.

« What if two or more corpora contain data annotated in markup languages that are, from a theoretical linguistics point-of-view, incompatible with each other (for example, if they are based on incompatible theoretical frameworks) – will it be possible to represent terms and concepts in the ontology that contradict each other? » (p. 8)

References

L. Romary, “Stabilizing knowledge through standards – A perspective for the humanities”, Going Digital: Evolutionary and Revolutionary Aspects of Digitization, 2010.
G. Rehm, O. Schonefeld, A. Witt, E. Hinrichs, and M. Reis, “Sustainability of annotated resources in linguistics: A web-platform for exploring, querying, and distributing linguistic corpora and other resources”, Literary and Linguistic Computing, vol. 24, iss. 2, pp. 193-210, 2009.
A. Witt, G. Rehm, E. Hinrichs, T. Lehmberg, and J. Stegmann, “SusTEInability of linguistic resources through feature structures,” Literary and Linguistic Computing, vol. 24, iss. 3, pp. 363-372, 2009.
L. Romary, “Questions \& Answers for TEI Newcomers”, Jahrbuch für Computerphilologie, vol. 10, 2009.
R. Eckart, “Choosing an XML database for linguistically annotated corpora”, Sprache und Datenverarbeitung, vol. 32, iss. 1, pp. 7-22, 2008.
TEI P5: Guidelines for Electronic Text Encoding and Interchange, Text Encoding Initiative Consortium, 2007 (technical report).
N. Ide and K. Suderman, “GrAF: A graph-based format for linguistic annotations”, in Proceedings of the Linguistic Annotation Workshop, 2007, pp. 1-8.
S. Dipper, M. Götze, U. Küssner, and M. Stede, “Representing and querying standoff XML”, in Data Structures for Linguistic Resources and Applications – Proceedings of the Biennial GLDV Conference 2007, Rehm, G., Witt, A., and Lemnitzer, L. (eds.), Tübingen: Gunter Narr, 2007, pp. 337-346.
K. Wörner, A. Witt, G: Rehm, and S. Dipper, “Modelling Linguistic Data Structures”, in Proceedings of Extreme Markup Languages, Montréal, Canada, 2006.
H. Lobin, “Textauszeichnung und Dokumentgrammatiken”, Texttechnologie, Lobin, H. and Lemnitzer, L. (eds.), Stauffenburg Verlag, 2003.
N. Ide, P. Bonhomme, and L. Romary, “XCES: An XML-based Encoding Standard for Linguistic Corpora”, in Proceedings of the Second Language Resources and Evaluation Conference (LREC), 2000, pp. 825-830.

Document-driven and data-driven, standoff and inline

A few trends in linguistic research

Known issues

References

Related Posts: