Batch file conversion to the same encoding on Linux

I recently had to deal with a series of files with different encodings in the same corpus, and I would like to share the solution I found in order to try to convert automatically all the files in a directory to the same encoding (here UTF-8).

file -i

I first tried to write a script in order to detect and correct the encoding, but it was everything but easy, so I decided to use UNIX software instead, assuming these tools would be adequate and robust enough.

I was not disappointed, as file for example gives relevant information when used …

more ...

Introducing the Microblog Explorer

The Microblog Explorer project is about gathering URLs from social networks (FriendFeed, identi.ca, and Reddit) to use them as web crawling seeds. At least by the last two of them a crawl appears to be manageable in terms of both API accessibility and corpus size, which is not the case concerning Twitter for example.

Hypotheses:

  1. These platforms account for a relative diversity of user profiles.
  2. Documents that are most likely to be important are being shared.
  3. It becomes possible to cover languages which are more rarely seen on the Internet, below the English-speaking spammer’s radar.
  4. Microblogging services are …
more ...

What is good enough to become part of a web corpus?

I recently worked at the FU Berlin with Roland Schäfer and Felix Bildhauer on issues related to web corpora. One of them deals with corpus construction: as a matter of fact, web documents can be very different, and even after a proper cleaning it is not rare to see things that could hardly be qualified as texts. While there are indubitably clear cases such as lists of addresses or tag clouds, it is not always obvious to define how extensive the notions of text and corpus are. What’s more, a certain amount of documents just end up too close …

more ...

Recipes for several model fitting techniques in R

As I recently tried several modeling techniques in R, I would like to share some of these, with a focus on linear regression.

Disclaimer: the code lines below work, but I would not suggest that they are the most efficient way to deal with this kind of data (as a matter of fact, all of them score slightly below 80% accuracy on the Kaggle datasets). Moreover, there are not always the most efficient way to implement a given model.

I see it as a way to quickly test several frameworks without going into details.

The column names used in the …

more ...

Data analysis and modeling in R: a crash course

Let’s pretend you recently installed R (a software to do statistical computing), you have a text collection you would like to analyze or classify and some time to lose. Here are a few quick commands that could get you a little further. I also write this kind of cheat sheet in order to remember a set of useful tricks and packages I recently gathered and from which I thought they could help others too.

Letter frequencies

In this example I will use a series of characteristics (or features) extracted from a text collection, more precisely the frequency of each …

more ...

Blind reason, Leibniz and the age of cybernetics

I would like to introduce an article resulting from a talk I recently held. I had the chance to speak at a conference for young researchers in philosophy held at the Université Paris-Est Créteil. The global frame was the criticism of ratio and rationalism during the 20th century. In order to illustrate such a criticism, I tackled the idea of blind reason in an article entitled ‘La Raison aveugle ? L’époque cybernétique et ses dispositifs‘, which I made available online (PDF file).

Brief summary

In a late interview, Martin Heidegger states that philosophy is bound to be replaced by cybernetics …

more ...

A note on Computational Models of Psycholinguistics

I would like to sum up a clear synthesis and state of the art of scientific traditions and ways to deal with language features as a whole. In a chapter entitled ‘Computational Models of Psycholinguistics’ and published in the Cambridge Handbook of Psycholinguistics, Nick Chater and Morten H. Christiansen distinguish three main traditions in psycholinguistic language modeling :

  • a symbolic (Chomskyan) tradition
  • connectionnist psycholinguistics
  • probabilistic models

They state that the Chomskyan approach (as well as nativist theories of language in general) outweighed until recently by far any other one, setting the ground for cognitive science :

Chomsky’s arguments concerning the formal …

more ...

Feeding the COW at the FU Berlin

I am now part of the COW project (COrpora on the Web). The project has been carried by (amongst others) Roland Schäfer and Felix Bildhauer at the FU Berlin for about two years. Work has already been done, especially concerning long-haul crawls in several languages.

Resources

A few resources have already been made available, software, n-gram models as well as web-crawled corpora, which for copyright reasons are not downloadable as a whole. They may be accessed through a special interface (COLiBrI – COW’s Light Browsing Interface) or downloaded upon request in a scrambled form (all sentences randomly reordered).

This is …

more ...

Ludovic Tanguy on Visual Analysis of Linguistic Data

In his professorial thesis (or habilitation thesis), which is about to be made public (the defence takes place next week), Ludovic Tanguy explains why and on what conditions data visualization could help linguists. In a previous post, I showed a few examples of visualization applied to the field of readability assessment. Tanguy’s questioning is more general, it has to do with what is to include in the disciplinary field of linguistics.

He gives a few reasons to use the methods from the emerging field of visual analytics and mentions some of its upholders (like Daniel Keim or Jean-Daniel Fekete …

more ...

Review of the readability checker DeLite

Continuing a series of reviews on readability assessment, I would like to describe a tool which is close to what I intend to do. It is named DeLite and is named a ‘readability checker’. It has been developed at the IICS research center of the FernUniversität Hagen.

From my point of view, its main feature is that it has not been made publicly available, it is based on software one has to buy and I did not manage to find even a demo version, although they claim to have been publicly (i.e. EU-)funded. Thus, my description is based …

more ...