A module to extract date information from web pages


Metadata extraction

Diverse content extraction and scraping techniques are routinely used on web document collections by companies and research institutions alike. Being able to better qualify the contents allows for insights based on metadata (e.g. content type, authors or categories), better bandwidth control (e.g. by knowing when webpages have been updated), or optimization of indexing (e.g. language-based heuristics, LRU cache, etc.).

In short, metadata extraction is useful for different kinds of purposes ranging from knowledge extraction and business intelligence to classification and refined visualizations. It is often necessary to fully parse the document or apply robust scraping patterns, there are for example webpages for which neither the URL nor the server response provide a reliable way to date the document, that is find when it was written.


I regularly work on improving the extraction methods for the web collections at my home institutions. They are unique as they combine both the quantity resulting from broad web crawling and the quality obtained by carefully extracting text and metadata as well as rejecting documents that do not match certain criteria. In that sense, I already published work on methods to derive metadata from web documents in order ...

more ...

Indexing text with ElasticSearch

The Lucene-based search engine Elasticsearch is fast and adaptable, so that it suits most demanding configurations, including large text corpora. I use it daily with tweets and began to release the scripts I use to do so. In this post, I give concrete tips for indexation of text and linguistic analysis.


You do not need to define a type for the indexed fields, the database can guess it for you, however it speeds up the process and gives more control to use a mapping. The official documentation is extensive and it is sometimes difficult to get a general idea of how to parametrize indexation: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html

Interesting options which are better specified before indexation include similarity scoring as well as term frequencies and positions.

Linguistic analysis

The string data type allows for the definition of the linguistic analysis to be used (or not) during indexation.

Elasticsearch ships with a series of language analysers which can be used for language-aware tokenization and indexation. Given a “text” field in German, here is where it happens in the mapping:

  "text": {
    "type" : "string",
    "index" : "analyzed",
    "analyzer" : "german",

Beyond that, it is possible to write ...

more ...

Parsing and converting HTML documents to XML format using Python’s lxml

The Internet is vast and full of different things. There are even tutorials explaining how to convert to or from XML formats using regular expressions. While this may work for very simple steps, as soon as exhaustive conversions and/or quality control is needed, working on a parsed document is the way to go.

In this post, I describe how I work using Python’s lxml module. I take the example of HTML to XML conversion, more specifically XML complying with the guidelines of the Text Encoding Initiative, also known as XML TEI.


A confortable installation is apt-get install python-lxml on Debian/Ubuntu, but the underlying packages may be old. The more pythonic way would be to make sure all the necessary libraries are installed (something like apt-get install libxml2-dev libxslt1-dev python-dev), and then using a package manager such as pip: pip install lxml.

Parsing HTML

Here are the modules required for basic manipulation:

from __future__ import print_function
from lxml import etree, html
from StringIO import StringIO

And here is how to read a file, supposing it is valid Unicode (it is not necessarily the case). The StringIO buffering is probably not the most direct way, but I found ...

more ...

Rule-based URL cleaning for text collections

I would like to introduce the way I clean lists of unknown URLs before going further (e.g. by retrieving the documents). I often use a Python script named clean_urls.py which I made available under a open-source license as a part of the FLUX-toolchain.

The following Python-based regular expressions show how malformed URLs, URLs leading to irrelevant content as well as URLs which obviously lead to adult content and spam can be filtered using a rule-based approach.

Avoid recurrent sites and patterns to save bandwidth

First, it can be useful to make sure that the URL was properly parsed before making it into the list, the very first step would be to check whether it starts with the right protocol (ftp is deemed irrelevant in my case).

protocol = re.compile(r'^http', re.IGNORECASE)

Then, it is necessary to find and extract URLs nested inside of a URL: referrer URLs, links which were not properly parsed, etc.

match = re.search(r'^http.+?(https?://.+?$)', line)

After that, I look at the end of the URLset rid of URLs pointing to files which are frequent but obviously not text-based, both at the end and inside the URL:

# obvious extensions
extensions ...
more ...

Guessing if a URL points to a WordPress blog

I am currently working on a project for which I need to identify WordPress blogs as fast as possible, given a list of URLs. I decided to write a review on this topic since I found relevant but sparse hints on how to do it.

First of all, let’s say that guessing if a website uses WordPress by analysing HTML code is straightforward if nothing was been done to hide it, which is almost always the case. As WordPress is one of the most popular content management systems, downloading every page and performing a check afterward is an option that should not be too costly if the amount of web pages to analyze is small. However, downloading even a reasonable number of web pages may take a lot of time, that is why other techniques have to be found to address this issue.

The way I chose to do it is twofold, the first filter is URL-based whereas the final selection uses HTTP HEAD requests.

URL Filter

There are webmasters who create a subfolder named “wordpress” which can be seen clearly in the URL, providing a kind of K.O. victory. If the URLs points to a non-text ...

more ...

Batch file conversion to the same encoding on Linux

I recently had to deal with a series of files with different encodings in the same corpus, and I would like to share the solution I found in order to try to convert automatically all the files in a directory to the same encoding (here UTF-8).

file -i

I first tried to write a script in order to detect and correct the encoding, but it was everything but easy, so I decided to use UNIX software instead, assuming these tools would be adequate and robust enough.

I was not disappointed, as file for example gives relevant information when used with this syntax: file -i filename. In fact, there are other tools such as enca, but I was luckier with this one.

input: file -i filename
output: filename: text/plain; charset=utf-8

grep -Po ‘…\K…’

First of all, one gets an answer of the kind filename: text/plain; charset=utf-8 (if everything goes well), which has to be parsed. In order to do this grep is an option. The -P option unlocks the power of Perl regular expressions, the -o option ensures that only the match will be printed and not the whole line, and finally the \K tells the ...

more ...

Recipes for several model fitting techniques in R

As I recently tried several modeling techniques in R, I would like to share some of these, with a focus on linear regression.

Disclaimer: the code lines below work, but I would not suggest that they are the most efficient way to deal with this kind of data (as a matter of fact, all of them score slightly below 80% accuracy on the Kaggle datasets). Moreover, there are not always the most efficient way to implement a given model.

I see it as a way to quickly test several frameworks without going into details.

The column names used in the examples are from the Titanic track on Kaggle.

Generalized linear models

titanic.glm <- glm (survived ~ pclass + sex + age + sibsp, data = titanic, family = binomial(link=logit))
glm.pred <- predict.glm(titanic.glm, newdata = titanicnew, na.action = na.pass, type = "response")
  • cat’ actually prints the output
  • One might want to use the na.action switch to be able to deal with incomplete data (as in the Titanic dataset) : na.action=na.pass

Link to glm in R manual.

Mixed GAM (Generalized Additive Models) Computation Vehicle

The commands are a little less obvious:

titanic.gam <- gam (survived ~ pclass ...
more ...

Data analysis and modeling in R: a crash course

Let’s pretend you recently installed R (a software to do statistical computing), you have a text collection you would like to analyze or classify and some time to lose. Here are a few quick commands that could get you a little further. I also write this kind of cheat sheet in order to remember a set of useful tricks and packages I recently gathered and from which I thought they could help others too.

Letter frequencies

In this example I will use a series of characteristics (or features) extracted from a text collection, more precisely the frequency of each letter from a to z (all lowercase). By the way, it goes as simple as that using Perl and regular expressions (provided you have a $text variable):

my @letters = ("a" .. "z");
foreach my $letter (@letters) {
    my $letter_count = () = $text =~ /$letter/gi;
    printf "%.3f", (($letter_count/length($text))*100);

First tests in R

After having started R (‘R’ command), one usually wants to import data. In this case, my file type is TSV (Tab-Separated Values) and the first row contains only describers (from ‘a’ to ‘z’), which comes at hand later. This is done using the read.table command.

alpha <- read.table("letters_frequency ...
more ...

Find and delete LaTeX temporary files

This morning I was looking for a way to delete the dispensable aux, bbl, blg, log, out and toc files that a pdflatex compilation generates. I wanted it to go through directories so that it would eventually find old files and delete them too. I also wanted to do it from the command-line interface and to integrate it within a bash script.

As I didn’t find this bash snippet as such, i.e. adapted to the LaTeX-generated files, I post it here :

find . -regex ".*\(aux\|bbl\|blg\|log\|nav\|out\|snm\|toc\)$" -exec rm -i {} \;

This works on Unix, probably on Mac OS and perhaps on Windows if you have Cygwin installed.


  • Find goes here through all the directories starting from where you are (.), it could also go through absolutely all directories (/) or search your Desktop for instance (something like \$Home/Desktop/).
  • The regular expression captures files ending with the (expandable) given series of letters, but also files with no extension which end with it (like test-aux).
    If you want it to stick to file extensions you may prefer this variant :
    find . \( -name "*.aux" -or -name "*.bbl" -or -name "*.blg" ... \)
  • The second part really removes the files that ...
more ...

Binary search to find words in a list: Perl tutorial

Given a dictionary, say one of the frequent words lists of the University of Leipzig, given a series of words: How can you check which ones belong to the list ?

Another option would be to use the operator available since Perl 5.10: :::perl if ($word ~~ @list) {…} But this gets very slow if the size of the list increases. I wrote a naive implementation of the binary search algorithm in Perl that I would like to share. It is not that fast though. Basic but it works.

First of all the wordlist gets read:

my $dict = 'leipzig10000';
open (DICTIONARY, $dict) or die "Error...: $!\n";
my @list = ;
close (DICTIONARY) or die "Error...: $!\n";
my $number = scalar @list;

Then you have to initialize the values (given a list @words) and the upper bound:

my $start= scalar @words;
my $log2 = 0;
my $n = $number;
my ($begin, $end, $middle, $i, $word);
my $a = 0;
while ($n > 1){
    $n = $n / 2;
    $log2 = $log2 + 1;

Then the binary search can start:

foreach $mot (@mots) {
    $begin = 0;
    $end = $number - 1;
    $middle = int($number/2);
    $word =~ tr/A-Z/a-z/;
    $i = 0;
    while($i < $log2 + 1){
        if ($word eq lc($list[$middle])){
        elsif ($word ...
more ...