Validating TEI-XML documents with Python

This post introduces two ways to validate XML documents in Python according the guidelines of the Text Encoding Initiative, using a format commonly known as TEI-XML. The first one takes a shortcut using a library I am working on, while the second one shows an exhaustive way to perform the operation.

Both ground on LXML, an efficient library for processing XML and HTML. The following lines of code will try to parse and validate a document in the same directory as the terminal window or Python console.

Shortcut with the trafilatura library

I am currently using this web scraping …

more ...

Two open-source corpus-builders for German and French

Introduction

I already described how to build a basic specialized crawler on this blog. I also wrote about crawling a newspaper website to build a corpus. As I went on work on this issue, I decided to release a few useful scripts under an open-source license.

The crawlers are not just mere link-harvesters, they are designed to be used as corpus-builders. As one cannot republish anything but quotations of the texts, the purpose is to enable others to make their own version of the corpora. Since the newspapers are updated quite often, it is not imaginable to create exact duplicates …

more ...

2nd release of the German Political Speeches Corpus

Last Monday, I released an updated version of both corpus and visualization tool on the occasion of the DGfS-CL Poster-Session in Frankfurt, where I presented a poster (in German).

The first version had been made available last summer and mentioned on this blog, cf this post: Introducing the German Political Speeches Corpus and Visualization Tool.

For stability, the resource is available at this permanent redirect: http://purl.org/corpus/german-speeches

Description

In case you don’t remember it or never heard of it, here is a brief description:

The resource presented here consists of speeches by the last German …

more ...

XML standards for language corpora (review)

Document-driven and data-driven, standoff and inline

First of all, the intention of the encoding can be different. Richard Eckart summarizes two main trends: document-driven XML and data-driven XML. While the first uses an « inline approach » and is « usually easily human-readable and meaningful even without the annotations », the latter is « geared towards machine processing and functions like a database record. […] The order of elements often is meaningless. » (Eckart 2008 p. 3)

In fact, several choices of architecture depend on the goal of an annotation using XML. The main division regards standoff and inline XML (also : stand-off and in-line).

The Paula format …

more ...