Bits of Language: corpus linguistics, NLP and text analytics

Binary search to find words in a list: Perl tutorial

Given a dictionary, say one of the frequent words lists of the University of Leipzig, given a series of words: How can you check which ones belong to the list ?

Another option would be to use the operator available since Perl 5.10: :::perl if ($word ~~ @list) {…} But this gets very slow if the size of the list increases. I wrote a naive implementation of the binary search algorithm in Perl that I would like to share. It is not that fast though. Basic but it works.

First of all the wordlist gets read:

my $dict = 'leipzig10000';
open (DICTIONARY, $dict …

more ...

Resource links update

I recently updated the blogroll section and I also would like to share a few links:

a huge links selection on pattern matching: Pattern Matching Pointers
the comp.text Frequently Asked Questions (Usenet archive)
links on annotation (page in French, links in English) collected by Karën Fort (Paris 13)

As I will be teaching LaTeX soon the LaTeX links section of the blog has expanded.

Last but not least, here is an E-Book, Mining of Massive Datasets, by A. Rajaraman and J. D. Ullmann. It was made of classes taught at Stanford and is now free to use (available chapter …

more ...

Quick review of the Falko Project

The Falko Project is an error-annotated corpus of German as a foreign language, maintained by the Humboldt Universität Berlin who made it publicly accessible.

Recently a new search engine was made available, practically replacing the old CQP interface. This tool is named ANNIS2 and can handle complex queries on the corpus.

Corpus

There are several subcorpora, and apparently more to come. The texts were written by advanced learners of German. There are most notably summaries (with the original texts and a comparable corpus of summaries written by native-speakers), essays who come from different locations (with the same type of comparable …

more ...

Having fun and making money doing research

What do people look for ? A few years ago it would have been difficult to gather information at a large scale and grab it with a powerful, yet more or less objective tool. Nowadays a single company is able to know what you want, what you buy or what you just did. And sometimes it shares a little bit of the data.

So, the end of the year gives me an occasion to try and discover changes in the mentalities using the ready-to-use Google Trends. Just for fun…

How does research compare with other interests ?

First of all, research is …

more ...

Three series of recorded lectures

Here is my selection of introductory courses given by well-known specialists in Computer Science or Natural Language Processing and recorded so that they can be followed at home.

1. Artificial Intelligence | Natural Language Processing, Christopher D. Manning, Stanford University.
More than 20 hours, 18 lectures.
Introduction to the key topics of NLP, summary of existing models.
Lecture 12 : Dan Jurafsky as a guest lecturer.
Requires the Silverlight plugin (no comment). Transcripts available.

2. Bits, Harry R. Lewis, Harvard University.
A general overview of information as quantity and quantitative methods.
Very comprehensive lecture (data theories, internet protocols, encryption, copyright issues, laws …

more ...

On Text Linguistics

Talking about text complexity in my last post, I did not realize how important it is to take the framework of text linguistics into account. This branch of linguistics is well-known in Germany but is not really meant as a topic by itself elsewhere. Most of the time, no one makes a distinction between text linguistics and discourse analysis, although the background is not necessarily the same.

I saw a presentation by Jean-Michel Adam last week, who describes himself as the “last of the Mohicans” to use this framework in French research. He drew a comprehensive picture of its origin …

more ...

E. Castello, Text Complexity and Reading Comprehension Tests - Reading Notes

Here is what I retain from my reading of this book: * E. Castello, Text Complexity and Reading Comprehension Tests, Bern: Peter Lang, 2008.

Notional framework

To begin with, Castello identifies two types of complexity, and states that research in this field attempts to quantify inherent complexity and receiver-oriented complexity, i.e. complexity or difficulty per se on one side and in terms of reader and text on the other.

He cites C.J. Alderson and L. Merlini Barbaresi (strangely enough, we are not related, as far as I know) for their definition of linguistic complexity, M. Halliday and T. Gibson …

more ...

Using and parsing the hCard microformat, an introduction

Recently, as I decided to get involved in the design of my personal page, I learned how to represent semantic markup on a web page. I would like to share a few things about writing and parsing semantic information in this format. I have the intuition that it is only the beginning and that there will be more and more formats to describe who you are, what do you do, who your are related to, where you link to, and engines that gather these informations.

First of all, the hCard microformat points to this standard, hCard 1.0.1. For …

more ...

Commented bibliography on readability assessment

I have selected a few papers on readability published in the last years, all available online (for instance using a specialized search engine, see previous post):

First of all, I reviewed this one last week, it is a very up-to-date article. L. Feng, M. Jansche, M. Huenerfauth, and N. Elhadad, “A Comparison of Features for Automatic Readability Assessment”, 2010, pp. 276-284.
The seminal paper to which Feng et al. often refers, as they combine several approaches, especially statistical language models, support vector machines and more traditional criteria. A comprehensive bibliography. S. E. Schwarm and M. Ostendorf, “Reading level assessment using …

more ...

Comparison of Features for Automatic Readability Assessment: review

I read an interesting article, “featuring” an up-to-date comparison of what is being done in the field of readability assessment:

“A Comparison of Features for Automatic Readability Assessment”, Lijun Feng, Martin Jansche, Matt Huenerfauth, Noémie Elhadad, 23rd International Conference on Computational Linguistics (COLING 2010), Poster Volume, pp. 276-284.

I am interested in the features they use. Let’s summarize, I am going to do a quick recension:

Corpus and tools

Corpus: a sample from the Weekly Reader
OpenNLP to extract named entities and resolve co-references
the Weka learning toolkit for machine learning

Features

Four subsets of discourse features:

more ...