Parsing and converting HTML documents to XML format using Python’s lxml

The Internet is vast and full of different things. There are even tutorials explaining how to convert to or from XML formats using regular expressions. While this may work for very simple steps, as soon as exhaustive conversions and/or quality control is needed, working on a parsed document is the way to go.

In this post, I describe how I work using Python’s lxml module. I take the example of HTML to XML conversion, more specifically XML complying with the guidelines of the Text Encoding Initiative, also known as XML TEI.

Installation

A confortable installation is apt-get install python-lxml on Debian/Ubuntu, but the underlying packages may be old. The more pythonic way would be to make sure all the necessary libraries are installed (something like apt-get install libxml2-dev libxslt1-dev python-dev), and then using a package manager such as pip: pip install lxml.

Parsing HTML

Here are the modules required for basic manipulation:

from __future__ import print_function
from lxml import etree, html
from StringIO import StringIO

And here is how to read a file, supposing it is valid Unicode (it is not necessarily the case). The StringIO buffering is probably not the most direct way, but I found ...

more ...

Rule-based URL cleaning for text collections

I would like to introduce the way I clean lists of unknown URLs before going further (e.g. by retrieving the documents). I often use a Python script named clean_urls.py which I made available under a open-source license as a part of the FLUX-toolchain.

The following Python-based regular expressions show how malformed URLs, URLs leading to irrelevant content as well as URLs which obviously lead to adult content and spam can be filtered using a rule-based approach.

Avoid recurrent sites and patterns to save bandwidth

First, it can be useful to make sure that the URL was properly parsed before making it into the list, the very first step would be to check whether it starts with the right protocol (ftp is deemed irrelevant in my case).

protocol = re.compile(r'^http', re.IGNORECASE)

Then, it is necessary to find and extract URLs nested inside of a URL: referrer URLs, links which were not properly parsed, etc.

match = re.search(r'^http.+?(https?://.+?$)', line)

After that, I look at the end of the URLset rid of URLs pointing to files which are frequent but obviously not text-based, both at the end and inside the URL:

# obvious extensions
extensions ...
more ...

Introducing the Microblog Explorer

The Microblog Explorer project is about gathering URLs from social networks (FriendFeed, identi.ca, and Reddit) to use them as web crawling seeds. At least by the last two of them a crawl appears to be manageable in terms of both API accessibility and corpus size, which is not the case concerning Twitter for example.

Hypotheses:

  1. These platforms account for a relative diversity of user profiles.
  2. Documents that are most likely to be important are being shared.
  3. It becomes possible to cover languages which are more rarely seen on the Internet, below the English-speaking spammer’s radar.
  4. Microblogging services are a good alternative to overcome the limitations of seed URL collections (as well as the biases implied by search engine optimization techniques and link classification).

Characteristics so far:

  • The messages themselves are not being stored (links are filtered on the fly using a series of heuristics).
  • The URLs that are obviously pointing to media documents are discarded, as the final purpose is to be able to build a text corpus.
  • This approach is ‘static’, as it does not rely on any long poll requests, it actively fetches the required pages.
  • Starting from the main public timeline, the scripts aim at ...
more ...