Parallel work with two taggers

I am working on the part-of-speech-tagging of the German political speeches corpus, and I would like to get tags from two different kinds of POS-taggers :

  • on one hand the TreeTagger, a hidden Markov model tagger which uses statistical rules and decision trees,
  • on the other the Stanford POS-Tagger, a bidirectional maximum entropy tagger.

This is easier said than done.

I am using the 2011-05-18 version of the Stanford Tagger with its standard models for German (I don’t know if any of the problems I encountered would be different with a newer or still-to-come version) and the basic version of the TreeTagger with the standard model for German.

A few issues

  • The Stanford-Tagger does not recognize the € symbol, and as in similar cases it adds a word and a tag explaining that the symbol is unknown.
  • There are non-breaking hyphens in my corpus, which (in my opinion) result from a too hasty cleaning of the texts before there where published, or a strange publication software. All the hyphens appear as white spaces, including in the HTML source, but in fact they are a Unicode sign. The TreeTagger treats them as spaces, the Stanford Tagger spits an error, marks …
more ...

Building a topic-specific corpus out of two different corpora

I have (say, I crawled two websites and got hold of) two corpora which sometimes focus on the same topics. I would like to try and melt them together in order to build a balanced and coherent corpus. As this is a highly discussed research topic there are plenty of subtle ways to do it.

Still, as I am only at the beginning of my research and as I don’t know how far I am going to go with both corpora I want to keep it simple.

One of the appropriate techniques (if not the best)

I could do it using LSA (in this particular case Latent semantic analysis, and not Lysergic acid amide!) or to be more precise Latent semantic indexing.

As this technical report shows, it can perform well in that kind of case: Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus,  B. Pincombe, Australian Department of Defence, 2004. (full text available here or through any good search engine, see previous post)

This could be an issue for later research.

The approach that I am working on (not quick and dirty but simpler and hopefully robust)

For exercise …

more ...

Collecting academic papers

I would like to build a corpus from a variety of scientific papers of a given field in a given language (german).

The problems of crawling put aside, I wonder if there is a way to do this automatically. All the papers I read deal with hand-collected corpora.

The Open Archive format might be a good workaround (see The Open Archives Initiative Protocol for Metadata Harvesting). As far as I know it is well-spread. And there are search engines that look for academic papers and use these metadata.

Among the most popular ones (Google Scholar, Scirus, OAIster), a few seem to deal with a lot of german texts : Scientific Commons (St. Gallen, CH) and Base (Bielefeld).

I read an interesting article today about the search engines regarding this particular field: by Dirk Pieper and Sebastian Wolf from the University Library of Bielefeld, “Wissenschaftliche Dokumente in Suchmaschinen”, in Handbuch Internet-Suchmaschinen, D. Lewandowski (ed.), Heidelberg, 2009. PDF version here.

I could crawl the result pages of a given website and see what I get. I’ll see what I can do.

more ...