A one-pass valency-oriented chunker for German

I recently introduced at the LTC‘13 conference a tool I developed to help performing fast text analysis on web corpora: a one-pass valency-oriented chunker for German.

Motivation

It turns out that topological fields together with chunked phrases provide a solid basis for a robust analysis of German sentence structure.”
E. W. Hinrichs, “Finite-State Parsing of German”, in Inquiries into Words, Constraints and Contexts, A. Arppe and et al. (eds.), Stanford: CSLI Publications, pp. 35–44, 2005.

Abstract

Non-finite state parsers provide fine-grained information but they are computationally demanding, so that it can be interesting to see how far a shallow parsing approach is able to go.

The transducer described here consists in a pattern-based matching operation of POS-tags using regular expressions that takes advantage of the characteristics of German grammar. The process aims at finding linguistically relevant phrases with a good precision, which enables in turn an estimation of the actual valency of a given verb.

The chunker reads its input exactly once instead of using cascades, which greatly benefits computational efficiency.

This finite-state chunking approach does not return a tree structure, but rather yields various kinds of linguistic information useful to the language researcher: possible applications include ...

more ...

Guessing if a URL points to a WordPress blog

I am currently working on a project for which I need to identify WordPress blogs as fast as possible, given a list of URLs. I decided to write a review on this topic since I found relevant but sparse hints on how to do it.

First of all, let’s say that guessing if a website uses WordPress by analysing HTML code is straightforward if nothing was been done to hide it, which is almost always the case. As WordPress is one of the most popular content management systems, downloading every page and performing a check afterward is an option that should not be too costly if the amount of web pages to analyze is small. However, downloading even a reasonable number of web pages may take a lot of time, that is why other techniques have to be found to address this issue.

The way I chose to do it is twofold, the first filter is URL-based whereas the final selection uses HTTP HEAD requests.

URL Filter

There are webmasters who create a subfolder named “wordpress” which can be seen clearly in the URL, providing a kind of K.O. victory. If the URLs points to a non-text ...

more ...

Review of the Czech internet corpus

Web for “old school” balanced corpus

The Czech internet corpus (Spoustová and Spousta 2012) is a good example of focused web corpora built in order to gather an “old school” balanced corpus encompassing different genres and several text types.

The crawled websites are not selected automatically or at random but according to the linguists’ expert knowledge: the authors mention their “knowledge of the Czech Internet” and their experience on “web site popularity”. The whole process as well as the target websites are described as follows:

We have chosen to begin with manually selecting, crawling and cleaning particular web sites with large and good-enough-quality textual content (e.g. news servers, blog sites, young mothers discussion fora etc.).” (p. 311)

Boilerplate removal

The boilerplate removal part is specially crafted for each target, the authors speak of “manually written scripts”. Texts are picked within each website according to their knowledge. Still, as the number of documents remains too high to allow for a completely manual selection, the authors use natural language processing methods to avoid duplicates.

Workflow

Their workflow includes:

  1. download of the pages,
  2. HTML and boilerplate removal,
  3. near-duplicate removal,
  4. and finally a language detection, which does not deal with English text but ...
more ...

Overview of URL analysis and classification methods

The analysis of URLs using natural language processing methods has recently become a research topic by itself, all the more since large URL lists are considered as being part of the big data paradigm. Due to the quantity of available web pages and the costs of processing large amounts of data, it is now an Information Retrieval task to try to classify web pages merely by taking their URLs into account and without fetching the documents they link to.

Why is that so and what can be taken away from these methods ?

Interest and objectives

Obviously, the URLs contain clues regarding the ressource they point to. The URL analysis is about getting as much information as possible to try to predict several characteristics of a web page. The results may influence the way the URL is processed: prioritization, delay, building of focused URL groups, etc.

The main goal seems to be to save crawling time, bandwidth and disk space, which are issues everyone confronted to web-scale crawling has to deal with.

However, one could also argue that it is sometimes hard to figure out what hides behind a URL. Kan & Thi (2005) tackle this issue under the assumption that there ...

more ...

Batch file conversion to the same encoding on Linux

I recently had to deal with a series of files with different encodings in the same corpus, and I would like to share the solution I found in order to try to convert automatically all the files in a directory to the same encoding (here UTF-8).

file -i

I first tried to write a script in order to detect and correct the encoding, but it was everything but easy, so I decided to use UNIX software instead, assuming these tools would be adequate and robust enough.

I was not disappointed, as file for example gives relevant information when used with this syntax: file -i filename. In fact, there are other tools such as enca, but I was luckier with this one.

input: file -i filename
output: filename: text/plain; charset=utf-8

grep -Po ‘…\K…’

First of all, one gets an answer of the kind filename: text/plain; charset=utf-8 (if everything goes well), which has to be parsed. In order to do this grep is an option. The -P option unlocks the power of Perl regular expressions, the -o option ensures that only the match will be printed and not the whole line, and finally the \K tells the ...

more ...

Introducing the Microblog Explorer

The Microblog Explorer project is about gathering URLs from social networks (FriendFeed, identi.ca, and Reddit) to use them as web crawling seeds. At least by the last two of them a crawl appears to be manageable in terms of both API accessibility and corpus size, which is not the case concerning Twitter for example.

Hypotheses:

  1. These platforms account for a relative diversity of user profiles.
  2. Documents that are most likely to be important are being shared.
  3. It becomes possible to cover languages which are more rarely seen on the Internet, below the English-speaking spammer’s radar.
  4. Microblogging services are a good alternative to overcome the limitations of seed URL collections (as well as the biases implied by search engine optimization techniques and link classification).

Characteristics so far:

  • The messages themselves are not being stored (links are filtered on the fly using a series of heuristics).
  • The URLs that are obviously pointing to media documents are discarded, as the final purpose is to be able to build a text corpus.
  • This approach is ‘static’, as it does not rely on any long poll requests, it actively fetches the required pages.
  • Starting from the main public timeline, the scripts aim at ...
more ...

What is good enough to become part of a web corpus?

I recently worked at the FU Berlin with Roland Schäfer and Felix Bildhauer on issues related to web corpora. One of them deals with corpus construction: as a matter of fact, web documents can be very different, and even after a proper cleaning it is not rare to see things that could hardly be qualified as texts. While there are indubitably clear cases such as lists of addresses or tag clouds, it is not always obvious to define how extensive the notions of text and corpus are. What’s more, a certain amount of documents just end up too close to call. Nonetheless, this issue has to be addressed, since even a no-decision policy would have consequences, as certain linguistic phenomena become more or less accidentally over- or underrepresented in the final corpus. That is why we believe that linguists and “end users” in general should be aware of this kind of technicalities.

The Good, the Bad, and the Hazy

In an article to be published in the proceedings of the 8th Web as Corpus Workshop, The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction, we show that text quality is not always easy to assess ...

more ...

Recipes for several model fitting techniques in R

As I recently tried several modeling techniques in R, I would like to share some of these, with a focus on linear regression.

Disclaimer : the code lines below work, but I would not suggest that they are the most efficient way to deal with this kind of data (as a matter of fact, all of them score slightly below 80% accuracy on the Kaggle datasets). Moreover, there are not always the most efficient way to implement a given model.

I see it as a way to quickly test several frameworks without going into details.

The column names used in the examples are from the Titanic track on Kaggle.

Generalized linear models

titanic.glm <- glm (survived ~ pclass + sex + age + sibsp, data = titanic, family = binomial(link=logit))
glm.pred <- predict.glm(titanic.glm, newdata = titanicnew, na.action = na.pass, type = "response")
cat(glm.pred)`
  • cat’ actually prints the output
  • One might want to use the na.action switch to be able to deal with incomplete data (as in the Titanic dataset) : na.action=na.pass

Link to glm in R manual.

Mixed GAM (Generalized Additive Models) Computation Vehicle

The commands are a little less obvious:

library(mgcv)
titanic.gam <- gam (survived ~ pclass ...
more ...

Data analysis and modeling in R: a crash course

Let’s pretend you recently installed R (a software to do statistical computing), you have a text collection you would like to analyze or classify and some time to lose. Here are a few quick commands that could get you a little further. I also write this kind of cheat sheet in order to remember a set of useful tricks and packages I recently gathered and from which I thought they could help others too.

Letter frequencies

In this example I will use a series of characteristics (or features) extracted from a text collection, more precisely the frequency of each letter from a to z (all lowercase). By the way, it goes as simple as that using Perl and regular expressions (provided you have a $text variable):

my @letters = ("a" .. "z");
foreach my $letter (@letters) {
    my $letter_count = () = $text =~ /$letter/gi;
    printf "%.3f", (($letter_count/length($text))*100);
}

First tests in R

After having started R (‘R’ command), one usually wants to import data. In this case, my file type is TSV (Tab-Separated Values) and the first row contains only describers (from ‘a’ to ‘z’), which comes at hand later. This is done using the read.table command.

alpha <- read.table("letters_frequency ...
more ...

Blind reason, Leibniz and the age of cybernetics

I would like to introduce an article resulting from a talk I recently held. I had the chance to speak at a conference for young researchers in philosophy held at the Université Paris-Est Créteil. The global frame was the criticism of ratio and rationalism during the 20th century. In order to illustrate such a criticism, I tackled the idea of blind reason in an article entitled ‘La Raison aveugle ? L’époque cybernétique et ses dispositifs‘, which I made available online (PDF file).

Brief summary

In a late interview, Martin Heidegger states that philosophy is bound to be replaced by cybernetics. Starting from this contestable point of view, I try to describe the value of the cybernetics paradigm for philosophy of technology.

I already mentioned the work of Gilbert Hottois on this blog (see the philosophy of technology category). In this article, I shed light on the relationship between what Hottois calls ‘operative techno-logy’ (in a functional sense) and the origins of this notion, dating back, according to him, to the calculability of signs by Leibniz, who writes about this particular type of combinatorial way to gain knowledge that it is ‘blind’ (cognitio caeca vel symbolica).

On one hand, that which ...

more ...