The Internet is vast and full of different things. There are even tutorials explaining how to convert to or from XML formats using regular expressions. While this may work for very simple steps, as soon as exhaustive conversions and/or quality control is needed, working on a parsed document is the way to go.
In this post, I describe how I work using Python’s lxml module. I take the example of HTML to XML conversion, more specifically XML complying with the guidelines of the Text Encoding Initiative, also known as XML TEI.
A confortable installation is
apt-get install python-lxml on Debian/Ubuntu, but the underlying packages may be old. The more pythonic way would be to make sure all the necessary libraries are installed (something like
apt-get install libxml2-dev libxslt1-dev python-dev), and then using a package manager such as pip:
pip install lxml.
Here are the modules required for basic manipulation:
from __future__ import print_function from lxml import etree, html from StringIO import StringIO
And here is how to read a file, supposing it is valid Unicode (it is not necessarily the case). The StringIO buffering is probably not the most direct way, but I found ...more ...