Binary search to find words in a list: Perl tutorial

Given a dictionary, say one of the frequent words lists of the University of Leipzig, given a series of words: How can you check which ones belong to the list ?

Another option would be to use the operator available since Perl 5.10: :::perl if ($word ~~ @list) {…} But this gets very slow if the size of the list increases. I wrote a naive implementation of the binary search algorithm in Perl that I would like to share. It is not that fast though. Basic but it works.

First of all the wordlist gets read:

my $dict = 'leipzig10000';
open (DICTIONARY, $dict) or die "Error...: $!\n";
my @list = ;
close (DICTIONARY) or die "Error...: $!\n";
my $number = scalar @list;

Then you have to initialize the values (given a list @words) and the upper bound:

my $start= scalar @words;
my $log2 = 0;
my $n = $number;
my ($begin, $end, $middle, $i, $word);
my $a = 0;
while ($n > 1){
    $n = $n / 2;
    $log2 = $log2 + 1;

Then the binary search can start:

foreach $mot (@mots) {
    $begin = 0;
    $end = $number - 1;
    $middle = int($number/2);
    $word =~ tr/A-Z/a-z/;
    $i = 0;
    while($i < $log2 + 1){
        if ($word eq lc($list[$middle])){
        elsif ($word …
more ...

Resource links update

I recently updated the blogroll section and I also would like to share a few links:

As I will be teaching LaTeX soon the LaTeX links section of the blog has expanded.

Last but not least, here is an E-Book, Mining of Massive Datasets, by A. Rajaraman and J. D. Ullmann. It was made of classes taught at Stanford and is now free to use (available chapter by chapter or as a whole), very up-to-date and informative on this hot topic. It seems to be a good introduction as well. That said I cannot really review it since I am not an expert of this research field.

Here is the reference:

  • A. Rajaraman and J. D. Ullmann, Mining of Massive Datasets, Stanford, Palo Alto, CA: e-book, 2010.
more ...

Quick review of the Falko Project

The Falko Project is an error-annotated corpus of German as a foreign language, maintained by the Humboldt Universität Berlin who made it publicly accessible.

Recently a new search engine was made available, practically replacing the old CQP interface. This tool is named ANNIS2 and can handle complex queries on the corpus.


There are several subcorpora, and apparently more to come. The texts were written by advanced learners of German. There are most notably summaries (with the original texts and a comparable corpus of summaries written by native-speakers), essays who come from different locations (with the same type of comparable corpus) and a ‘longitudinal’ corpus coming from students of the Georgetown-University of Washington.

The corpora are annotated by a part-of-speech tagger (the TreeTagger) so that word types and lemmas are known but most of all the mistakes can be found, with several hypotheses at different levels (mainly what the correct sentence would be and what might be the reason of the mistake).


The engine (ANNIS2) has a good tutorial (in English by the way) so that it is not that difficult to search for complex patterns across the subcorpora. It seems also efficient in terms of speed. You may …

more ...

Having fun and making money doing research

What do people look for ? A few years ago it would have been difficult to gather information at a large scale and grab it with a powerful, yet more or less objective tool. Nowadays a single company is able to know what you want, what you buy or what you just did. And sometimes it shares a little bit of the data.

So, the end of the year gives me an occasion to try and discover changes in the mentalities using the ready-to-use Google Trends. Just for fun…

How does research compare with other interests ?

First of all, research is no fun, it was more requested than money and was at the level of work, but things have changed. It still outnumbers fun in the news though.

A few trends regarding research

A few trends regarding research, “Research is no fun”… Source: Google), worldwide trends.

People seem to look for money more often than a few years ago, it’s the only thing which becomes more popular, even work just remains stable.

A remark: I think the search volume is much more bigger now than it was back in 2004, there are also more languages available, and probably more search terms (since the users may …

more ...

Three series of recorded lectures

Here is my selection of introductory courses given by well-known specialists in Computer Science or Natural Language Processing and recorded so that they can be followed at home.

1. Artificial Intelligence | Natural Language Processing, Christopher D. Manning, Stanford University.
More than 20 hours, 18 lectures.
Introduction to the key topics of NLP, summary of existing models.
Lecture 12 : Dan Jurafsky as a guest lecturer.
Requires the Silverlight plugin (no comment). Transcripts available.

2. Bits, Harry R. Lewis, Harvard University.
A general overview of information as quantity and quantitative methods.
Very comprehensive lecture (data theories, internet protocols, encryption, copyright issues, laws…), cut in small pieces for you to pick a focused topic.
Several formats available, links to blog posts.

3. Search Engines: Technology, Society, and Business, various lecturers, UC Berkeley.
Fall 2007, 13 lectures.
Overview of the topic.
Requires iTunes (no comment).

more ...

On Text Linguistics

Talking about text complexity in my last post, I did not realize how important it is to take the framework of text linguistics into account. This branch of linguistics is well-known in Germany but is not really meant as a topic by itself elsewhere. Most of the time, no one makes a distinction between text linguistics and discourse analysis, although the background is not necessarily the same.

I saw a presentation by Jean-Michel Adam last week, who describes himself as the “last of the Mohicans” to use this framework in French research. He drew a comprehensive picture of its origin and its developments which I am going to try and sum up.

This field started to become popular in the ‘70s with books by Eugenio Coseriu, Harald Weinrich (in Germany), Frantisek Danek (and the Functional Sentence Perspective Framework) or MAK Halliday who was a lot more read in English-speaking countries. Text linguistics is not a grammatical description of language, nor is it bound to a particular language. It is a science of the texts, a theory which comes on top of several levels such as semantics or structure analysis. It enables to distinguish several classes of texts at a global …

more ...

E. Castello, Text Complexity and Reading Comprehension Tests - Reading Notes

Here is what I retain from my reading of this book: * E. Castello, Text Complexity and Reading Comprehension Tests, Bern: Peter Lang, 2008.

Notional framework

To begin with, Castello identifies two types of complexity, and states that research in this field attempts to quantify inherent complexity and receiver-oriented complexity, i.e. complexity or difficulty per se on one side and in terms of reader and text on the other.

He cites C.J. Alderson and L. Merlini Barbaresi (strangely enough, we are not related, as far as I know) for their definition of linguistic complexity, M. Halliday and T. Gibson regarding lexical information, S. Urquhart and C. Weir for their work on different types of reading.


Erik Castello uses a series of measures, most notably:

  1. word-related:
  2. - type/token ratio - word frequency lists - lexical density - lexical variation - lexical density: difference between lexical and grammatical words, multi-word units, research on word families
  3. clause-related:
  4. - clause type ratio, meaning for instance the ratio between hypotactic clauses and clause complexes - grammatical intricacy
  5. sentence-related:
  6. - readability formulas

He mentions an interesting idea: to try and capture the intention of the writer according to a given situation, which can be compared with measures on discourse level (see …

more ...

Using and parsing the hCard microformat, an introduction

Recently, as I decided to get involved in the design of my personal page, I learned how to represent semantic markup on a web page. I would like to share a few things about writing and parsing semantic information in this format. I have the intuition that it is only the beginning and that there will be more and more formats to describe who you are, what do you do, who your are related to, where you link to, and engines that gather these informations.

First of all, the hCard microformat points to this standard, hCard 1.0.1.  For an explanation of what it is, see here on, for a global article on microformats see also Wikipedia.

The information displayed is useful as it is a way to markup semantic relations, so that named entities are correctly identified. By search engines for instance : Google supports several formats, including hCard, and there are more specific search engines which aim at gathering informations such as a contact or a product list starting from this kind of markup. For a comprehensive list see here.

Now, if you are interested in parsing microformats, there are several tools. Among them, my pick …

more ...

Commented bibliography on readability assessment

I have selected a few papers on readability published in the last years, all available online (for instance using a specialized search engine, see previous post):

  1. First of all, I reviewed this one last week, it is a very up-to-date article. L. Feng, M. Jansche, M. Huenerfauth, and N. Elhadad, “A Comparison of Features for Automatic Readability Assessment”, 2010, pp. 276-284.
  2. The seminal paper to which Feng et al. often refers, as they combine several approaches, especially statistical language models, support vector machines and more traditional criteria. A comprehensive bibliography. S. E. Schwarm and M. Ostendorf, “Reading level assessment using support vector machines and statistical language models”, in Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 2005, pp. 523-530.
  3. A complementary approach, also a combination of features, this time mainly of lexical and grammatical ones, with a focus on the latter, as the authors use parse trees and subtrees (i.e. «relative frequencies of partial syntactic derivations») at three different levels. I found this convincing. A comparison of three statistical models: Linear Regression, Proportional Odds Model and Multi-class Logistic Regression. M. Heilman, K. Collins-Thompson, and M. Eskenazi, “An analysis of statistical models and features for reading difficulty …
more ...

Comparison of Features for Automatic Readability Assessment: review

I read an interesting article, “featuring” an up-to-date comparison of what is being done in the field of readability assessment:

A Comparison of Features for Automatic Readability Assessment”, Lijun Feng, Martin Jansche, Matt Huenerfauth, Noémie Elhadad, 23rd International Conference on Computational Linguistics (COLING 2010), Poster Volume, pp. 276-284.

I am interested in the features they use. Let’s summarize, I am going to do a quick recension:

Corpus and tools

  • Corpus: a sample from the Weekly Reader
  • OpenNLP to extract named entities and resolve co-references
  • the Weka learning toolkit for machine learning


  • Four subsets of discourse features:
  • 1. entity-density features 2. lexical-chain features (chains rely on semantic relations as they are automatically detected) 3. co-reference inference features (a research novelty) 4. entity grid features (transition patterns according to the grammatical roles of the words)
  • Language Modeling Features, i.e. train language models
  • Parsed Syntactic Features, such as parse tree height
  • POS-based Features
  • Shallow Features, i.e. traditional readability metrics
  • Other features, mainly “perplexity features” according to Schwarm and Ostendorf (2005), see below


  • Combining discourse features doesn’t significantly improve accuracy, discourse features do not seem to be useful.
  • Language models trained with information gain outperform those trained …
more ...