Parsing and converting HTML documents to XML format using Python’s lxml

The Internet is vast and full of different things. There are even tutorials explaining how to convert to or from XML formats using regular expressions. While this may work for very simple steps, as soon as exhaustive conversions and/or quality control is needed, working on a parsed document is the way to go.

In this post, I describe how I work using Python’s lxml module. I take the example of HTML to XML conversion, more specifically XML complying with the guidelines of the Text Encoding Initiative, also known as XML TEI.

Update: I released a Python module that includes all operations described here and more: trafilatura

Installation

A confortable installation is apt-get install python-lxml on Debian/Ubuntu, but the underlying packages may be old. The more pythonic way would be to make sure all the necessary libraries are installed (something like apt-get install libxml2-dev libxslt1-dev python-dev), and then using a package manager such as pip: pip install lxml.

Parsing HTML

Here are the modules required for basic manipulation:

from __future__ import print_function
from lxml import etree, html
from StringIO import StringIO

And here is how to read a file, supposing it is valid Unicode (it is not necessarily the case). The StringIO buffering is probably not the most direct way, but I found it more practical to keep this intermediary step for debugging (file not found, unicode errors, etc.).

try:
    with open('filename.html', 'r') as inputfh:
        try:
            filecontent = inputfh.read().decode('utf-8')
        except UnicodeDecodeError:
            print ('ERROR: unicode')
except IOError as (errno, strerror):
    print ('ERROR: {1}'.format(errno, strerror))

parser = html.HTMLParser()
try:
    tree = html.parse(StringIO(filecontent), parser)
except etree.XMLSyntaxError, details:
    print ('ERROR: parser', details.error_log)

The error log should be detailed enough to correct eventual bugs. Supposing everything goes fine, the parsed HTML file is now contained in the variable named “tree”, which is to be operated on to extract and print out the desired content.

Extracting elements

There are several methods to walk through the tree (see lxml.etree tutorial and documentation for lxml.html). I am going to focus on XPath, a query format for selecting nodes, which is usually what I need.

try:
    # target and print all <p class="content"> elements and subelements
    for element in tree.xpath('//p[@class="content"]//*'):
        print(element.text_content()) # particular for lxml.html
except etree.XPathEvalError, details:
    print ('ERROR: XPath expression', details.error_log)

Writing to XML TEI

Here is a straightforward way to build an XML document and fill its elements with content extracted from the HTML file:

# prepare TEI document
tei = etree.Element('TEI', xmlns='http://www.tei-c.org/ns/1.0')
tei_header = etree.SubElement(tei, 'teiHeader')
filedesc = etree.SubElement(tei_header, 'fileDesc')
file_titlestmt = etree.SubElement(filedesc, 'titleStmt')
titlemain = etree.SubElement(file_titlestmt, 'title', type='main')
# and so on...

# extract the desired content
titlemain.text = tree.find('//head/title').text_content()
# and so on...

# print/write the result
print (etree.tostring(tree, pretty_print=True, xml_declaration=True, encoding='UTF-8')

Things to be aware of

Lxml uses a .tail property to target text following a closed element, which is contra-intuitive and which may lead to text being left apart. One has to make sure than these “tails” are caught and written to the output (see tutorial).

XPath 2.0 and 3.0 are not implemented in lxml, so that theoretically valid XPath expressions will raise an error. Yes, it is unfortunate.

Also recommended

The excellent ftfy module, which fixes broken Unicode in Python 2 and 3.
The HTML-cleaner component of lxml, which makes it possible to pass a list of tags to strip or delete directly to the parser (see documentation), which is not only faster but also easier to maintain.

All in all, I would really recommend aiming for readability of code and stability of procedures rather than speed or quick fixes. With the implementation of special cases and fine-grained decisions, my conversion scripts reach 500 or 1000 lines faster than I would like them to.