This post introduces two ways to validate XML documents in Python according the guidelines of the Text Encoding Initiative, using a format commonly known as TEI-XML. The first one takes a shortcut using a library I am working on, while the second one shows an exhaustive way to perform the operation.
Both ground on LXML, an efficient library for processing XML and HTML. The following lines of code will try to parse and validate a document in the same directory as the terminal window or Python console.
Shortcut with the trafilatura library
I am currently using this web scraping library to download web pages, find the main text and thecomments while preserving some structure, and convert the output to TXT, XML & TEI-XML. As such, I recently added a way to systematically check if the TEI-XML documents produced by the library are valid.
The library can be installed with
pip3 (depending on the system):
pip install lxml trafilatura As this functionality is new, please update trafilatura if you have already installed it:
pip install -U trafilatura.
Trafilatura will seamlessly download the schema on the first call and then return
True if a document is valid or …