XML-TEI with Python packages

This post introduces two ways to validate XML documents in Python according the guidelines of the Text Encoding Initiative, using a format commonly known as TEI-XML. The first one takes a shortcut using a library I am working on, while the second one shows an exhaustive way to perform the operation.

Both ground on LXML, an efficient library for processing XML and HTML. The following lines of code will try to parse and validate a document in the same directory as the terminal window or Python console.

Shortcut with the trafilatura library

I am currently using this web scraping library to download web pages, find the main text and thecomments while preserving some structure, and convert the output to TXT, XML & TEI-XML. As such, I recently added a way to systematically check if the TEI-XML documents produced by the library are valid.

The library can be installed with pip or pip3 (depending on the system): pip install lxml trafilatura As this functionality is new, please update trafilatura if you have already installed it: pip install -U trafilatura.

Trafilatura will seamlessly download the schema on the first call and then return True if a document is valid or a message related to the first error impeding validation otherwise:

# load the necessary components
from lxml import etree
from trafilatura.xml import validate_tei
# open a file and parse it
mytree = etree.parse('document-name.xml')
# validate it
validate_tei(mytree)
# returns True or an error message

Exhaustive code using LXML

To perform your own validation a few more steps are needed in order to load the components needed, fetch the RelaxNG schema published by the TEI consortium, load it into LXML and use it to validate the document. First, make sure you have installed the lxml and requests libraries (pip install lxml requests).

# load the necessary components
import requests
from io import StringIO
from lxml import etree
# download the TEI-XML schema
schema = requests.get('https://tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng').text

If no error message is shown, the RelaxNG TEI-schema is stored in this text variable. It is now necessary to parse it and load it into the validator. To do so, is seems necessary to go through this small workaround:

schema = schema.replace('<?xml version="1.0" encoding="utf-8"?>', '<?xml version="1.0"?>', 1)

This line will remove the unicode declaration to avoid a bug while parsing. We can then proceed to the rest of the process:

# load the schema into LXML
relaxng_doc = etree.parse(StringIO(schema))
tei_relaxng = etree.RelaxNG(relaxng_doc)
# open a file and parse it
mytree = etree.parse('document-name.xml')

I found two solutions to validate the document and process the result:

# validation alternative 1
result = tei_relaxng.validate(mytree) # boolean True or error message
# validation alternative 2 (slightly more pythonic)
try:
    result = tei_relaxng.assert_(mytree)
except AssertionError as err:
    print('TEI validation error:' + err)

If a document is not valid, the LXML parser will output an error message containing the line number and the type of error, for example the incriminated tag.