Rule-based URL cleaning for text collections

I would like to introduce the way I clean lists of unknown URLs before going further (e.g. by retrieving the documents). I often use a Python script named clean_urls.py which I made available under a open-source license as a part of the FLUX-toolchain.

The following Python-based regular expressions show how malformed URLs, URLs leading to irrelevant content as well as URLs which obviously lead to adult content and spam can be filtered using a rule-based approach.

Avoid recurrent sites and patterns to save bandwidth

First, it can be useful to make sure that the URL was properly …

more ...

Recipes for several model fitting techniques in R

As I recently tried several modeling techniques in R, I would like to share some of these, with a focus on linear regression.

Disclaimer: the code lines below work, but I would not suggest that they are the most efficient way to deal with this kind of data (as a matter of fact, all of them score slightly below 80% accuracy on the Kaggle datasets). Moreover, there are not always the most efficient way to implement a given model.

I see it as a way to quickly test several frameworks without going into details.

The column names used in the …

more ...

Data analysis and modeling in R: a crash course

Let’s pretend you recently installed R (a software to do statistical computing), you have a text collection you would like to analyze or classify and some time to lose. Here are a few quick commands that could get you a little further. I also write this kind of cheat sheet in order to remember a set of useful tricks and packages I recently gathered and from which I thought they could help others too.

Letter frequencies

In this example I will use a series of characteristics (or features) extracted from a text collection, more precisely the frequency of each …

more ...

Completing web pages on the fly with JavaScript

As I am working on a new release of the German Political Speeches Corpus, I looked for a way to make web pages lighter. I have lots of repetitive information, so that a little JavaScript is helpful when it comes to save on file size. Provided that the DOM structure is available, there are elements that may be completed on load.

For example, there are span elements which include specific text. By catching them and testing them against a regular expression the script is able to add attributes (like a class) to the right ones. Without activating JavaScript one still …

more ...

Display long texts with CSS, tutorial and example

Last week, I improved the CSS file that displays the (mostly long) texts of the German Political Speeches Corpus, which I introduced in my last post (“Introducing the German Political Speeches Corpus and Visualization Tool”). The texts should be easier to read now (though I do not study this kind of readability), you can see an example here (BP text 536).

I looked for ideas to design a clean and simple layout, but I did not find what I needed. So I will outline in this post the main features of my new CSS file:

  • First of all, margins, font-size …

more ...

Crawling a newspaper website to build a corpus

Basing on my previous post about specialized crawlers, I will show how I to crawl a French sports newspaper named L’Equipe using scripts written in Perl, which I did lately. For educational purpose, it works by now but it is bound to stop being efficient as soon as the design of the website changes.

Gathering links

First of all, you have to make a list of links so that you have something to start from. Here is the beginning of the script:

#!/usr/bin/perl #assuming you're using a UNIX-based system...
use strict; #because it gets messy without …
more ...

Building a basic specialized crawler

As I went on crawling again in the last few days I thought it could be helpful to describe the way I do.

Note that it is for educational purpose only (I am not assuming that I built the fastest and most reliable crawling engine ever) and that the aim is to crawl specific pages of interest. That implies I know which links I want to follow just by regular expressions, because I observe how a given website is organized.

I see two (or eventually three) steps in the process, which I will go through giving a few hints in …

more ...

Binary search to find words in a list: Perl tutorial

Given a dictionary, say one of the frequent words lists of the University of Leipzig, given a series of words: How can you check which ones belong to the list ?

Another option would be to use the operator available since Perl 5.10: :::perl if ($word ~~ @list) {…} But this gets very slow if the size of the list increases. I wrote a naive implementation of the binary search algorithm in Perl that I would like to share. It is not that fast though. Basic but it works.

First of all the wordlist gets read:

my $dict = 'leipzig10000';
open (DICTIONARY, $dict …
more ...

A fast bash pipe for TreeTagger

I have been working with the part-of-speech tagger developed at the IMS Stuttgart TreeTagger since my master thesis. It performs well on german texts as one could easily suppose, since it was one of its primary purposes. One major problem is that it’s poorly documented, so I would like to share the way that I found to pass things to TreeTagger through a pipe.

The first thing is that TreeTagger doesn’t take Unicode strings, as it dates back to the nineties. So you have to convert whatever you pass to ISO-8859-1, which the iconv software with the …

more ...