This post introduces two ways to validate XML documents in Python according the guidelines of the Text Encoding Initiative, using a format commonly known as TEI-XML. The first one takes a shortcut using a library I am working on, while the second one shows an exhaustive way to perform the operation.
Both ground on LXML, an efficient library for processing XML and HTML. The following lines of code will try to parse and validate a document in the same directory as the terminal window or Python console.
Shortcut with the trafilatura library
I am currently using this web scraping library to download web pages, find the main text and thecomments while preserving some structure, and convert the output to TXT, XML & TEI-XML. As such, I recently added a way to systematically check if the TEI-XML documents produced by the library are valid.
The library can be installed with
pip3 (depending on the system):
pip install lxml trafilatura As this functionality is new, please update trafilatura if you have already installed it:
pip install -U trafilatura.
Trafilatura will seamlessly download the schema on the first call and then return
True if a document is valid or a message related to the first error impeding validation otherwise:
# load the necessary components from lxml import etree from trafilatura.xml import validate_tei # open a file and parse it mytree = etree.parse('document-name.xml') # validate it validate_tei(mytree) # returns True or an error message
Exhaustive code using LXML
To perform your own validation a few more steps are needed in order to load the components needed, fetch the RelaxNG schema published by the TEI consortium, load it into LXML and use it to validate the document. First, make sure you have installed the
requests libraries (
pip install lxml requests).
# load the necessary components import requests from io import StringIO from lxml import etree # download the TEI-XML schema schema = requests.get('https://tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng').text
If no error message is shown, the RelaxNG TEI-schema is stored in this text variable. It is now necessary to parse it and load it into the validator. To do so, is seems necessary to go through this small workaround:
schema = schema.replace('<?xml version="1.0" encoding="utf-8"?>', '<?xml version="1.0"?>', 1)
This line will remove the unicode declaration to avoid a bug while parsing. We can then proceed to the rest of the process:
# load the schema into LXML relaxng_doc = etree.parse(StringIO(schema)) tei_relaxng = etree.RelaxNG(relaxng_doc) # open a file and parse it mytree = etree.parse('document-name.xml')
I found two solutions to validate the document and process the result:
# validation alternative 1 result = tei_relaxng.validate(mytree) # boolean True or error message # validation alternative 2 (slightly more pythonic) try: result = tei_relaxng.assert_(mytree) except AssertionError as err: print('TEI validation error:' + err)
If a document is not valid, the LXML parser will output an error message containing the line number and the type of error, for example the incriminated tag.