Batch file conversion to the same encoding on Linux

I recently had to deal with a series of files with different encodings in the same corpus, and I would like to share the solution I found in order to try to convert automatically all the files in a directory to the same encoding (here UTF-8).

file -i

I first tried to write a script in order to detect and correct the encoding, but it was everything but easy, so I decided to use UNIX software instead, assuming these tools would be adequate and robust enough.

I was not disappointed, as file for example gives relevant information when used with this syntax: file -i filename. In fact, there are other tools such as enca, but I was luckier with this one.

input: file -i filename
output: filename: text/plain; charset=utf-8

grep -Po ‘…\K…’

First of all, one gets an answer of the kind filename: text/plain; charset=utf-8 (if everything goes well), which has to be parsed. In order to do this grep is an option. The -P option unlocks the power of Perl regular expressions, the -o option ensures that only the match will be printed and not the whole line, and finally the \K tells the ...

more ...

Two open-source corpus-builders for German and French


I already described how to build a basic specialized crawler on this blog. I also wrote about crawling a newspaper website to build a corpus. As I went on work on this issue, I decided to release a few useful scripts under an open-source license.

The crawlers are not just mere link-harvesters, they are designed to be used as corpus-builders. As one cannot republish anything but quotations of the texts, the purpose is to enable others to make their own version of the corpora. Since the newspapers are updated quite often, it is not imaginable to create exact duplicates, that said the majority of the articles will be the same.

Interesting features

The interesting facts are that the crawlers are relatively fast (even if they were not set up for speed) and do not need a lot of computational resources. They may be run on a personal computer.

Due to their specialization, they are able to build a reliable corpus consisting of texts and relevant metadata (e.g. title, author, date and url). Thus, one may gather millions of tokens from home and start exploring the corpus.

The HTML code as well as the superfluous text are stripped in ...

more ...

Using a rule-based tokenizer for German

In order to solve a few tokenization problems and to delimit the sentences properly I decided not to fight with the tokenization anymore and to use an efficient script that would do it for me. There are taggers which integrate a tokenization process of their own, but that’s precisely why I need an independent one, so that I can let the several taggers downstream work on an equal basis.
I found an interesting script written by Stefanie Dipper of the University of Bochum, Germany. It is freely available here : Rule-based Tokenizer for German.


  • It’s written in Perl.
  • It performs a tokenization and a sentence boundary detection.
  • It can output the result in text mode as well as in XML format, including a detailed version where all the space types are qualified.
  • It was created to perform well on German.
    • It comes with an abbreviation list which fits German standards (e.g. the street names like Hauptstr.)
    • It tries to address the problem of the dates in German, which are often written using dots (e.g. 01.01.12), using a “hard-wired list of German date expressions” according to its author.
  • The code is clear and well documented ...
more ...

Parallel work with two taggers

I am working on the part-of-speech-tagging of the German political speeches corpus, and I would like to get tags from two different kinds of POS-taggers :

  • on one hand the TreeTagger, a hidden Markov model tagger which uses statistical rules and decision trees,
  • on the other the Stanford POS-Tagger, a bidirectional maximum entropy tagger.

This is easier said than done.

I am using the 2011-05-18 version of the Stanford Tagger with its standard models for German (I don’t know if any of the problems I encountered would be different with a newer or still-to-come version) and the basic version of the TreeTagger with the standard model for German.

A few issues

  • The Stanford-Tagger does not recognize the € symbol, and as in similar cases it adds a word and a tag explaining that the symbol is unknown.
  • There are non-breaking hyphens in my corpus, which (in my opinion) result from a too hasty cleaning of the texts before there where published, or a strange publication software. All the hyphens appear as white spaces, including in the HTML source, but in fact they are a Unicode sign. The TreeTagger treats them as spaces, the Stanford Tagger spits an error, marks ...
more ...

Crawling a newspaper website to build a corpus

Basing on my previous post about specialized crawlers, I will show how I to crawl a French sports newspaper named L’Equipe using scripts written in Perl, which I did lately. For educational purpose, it works by now but it is bound to stop being efficient as soon as the design of the website changes.

Gathering links

First of all, you have to make a list of links so that you have something to start from. Here is the beginning of the script:

#!/usr/bin/perl #assuming you're using a UNIX-based system...
use strict; #because it gets messy without, and because Perl is faster that way
use Encode; #you have to get the correct encoding settings of the pages
use LWP::Simple; #to get the webpages
use Digest::MD5 qw(md5_hex);

Just an explanation on the last line : we are going to use a hash function to shorten the links and make sure we fetch a single page just once.

my $url = ""; #the starting point

$page = get $url; #the variables ought to be defined somewhere before $page = encode(“iso-8859-1”, $page); #because the pages are not in Unicode format push (@done_md5, substr(md5_hex($url), 0, 8 ...

more ...

Building a basic specialized crawler

As I went on crawling again in the last few days I thought it could be helpful to describe the way I do.

Note that it is for educational purpose only (I am not assuming that I built the fastest and most reliable crawling engine ever) and that the aim is to crawl specific pages of interest. That implies I know which links I want to follow just by regular expressions, because I observe how a given website is organized.

I see two (or eventually three) steps in the process, which I will go through giving a few hints in pseudocode.

A shell script

You might want to write a shell script to fire the two main phases automatically and/or to save your results on a regular basis (if something goes wrong after a reasonable amount of explored pages you don’t want to lose all the work, even if it’s mainly CPU time and electricity).

A list of links

If the website has an archive, a sitemap or a general list of its contents you can spare time by picking the interesting links once and for all.

going through a shortlist of archives DO {      fetch page      find ...

more ...

A fast bash pipe for TreeTagger

I have been working with the part-of-speech tagger developed at the IMS Stuttgart TreeTagger since my master thesis. It performs well on german texts as one could easily suppose, since it was one of its primary purposes. One major problem is that it’s poorly documented, so I would like to share the way that I found to pass things to TreeTagger through a pipe.

The first thing is that TreeTagger doesn’t take Unicode strings, as it dates back to the nineties. So you have to convert whatever you pass to ISO-8859-1, which the iconv software with the translit option set does very well. It means here “find an equivalent if the character cannot be exactly translated”.

Then you have to define the options that you want to use. I put the most frequent ones in the example.


The advantage of a pipe is that you can clean the text while passing it to the tagger. Here is one way of doing it, by using the text editor sed to : 1. remove the trailing white lines 2. replace everything that’s more than one space by one space and 3. replacing spaces by new lines.

This way ...

more ...