A one-pass valency-oriented chunker for German

I recently introduced at the LTC‘13 conference a tool I developed to help performing fast text analysis on web corpora: a one-pass valency-oriented chunker for German.


It turns out that topological fields together with chunked phrases provide a solid basis for a robust analysis of German sentence structure.” E. W. Hinrichs, “Finite-State Parsing of German”, in Inquiries into Words, Constraints and Contexts, A. Arppe and et al. (eds.), Stanford: CSLI Publications, pp. 35–44, 2005.


Non-finite state parsers provide fine-grained information but they are computationally demanding, so that it can be interesting to see how far a shallow parsing approach is able to go.

The transducer described here consists in a pattern-based matching operation of POS-tags using regular expressions that takes advantage of the characteristics of German grammar. The process aims at finding linguistically relevant phrases with a good precision, which enables in turn an estimation of the actual valency of a given verb.

The chunker reads its input exactly once instead of using cascades, which greatly benefits computational efficiency.

This finite-state chunking approach does not return a tree structure, but rather yields various kinds of linguistic information useful to the language researcher: possible applications include …

more ...

Building a topic-specific corpus out of two different corpora

I have (say, I crawled two websites and got hold of) two corpora which sometimes focus on the same topics. I would like to try and melt them together in order to build a balanced and coherent corpus. As this is a highly discussed research topic there are plenty of subtle ways to do it.

Still, as I am only at the beginning of my research and as I don’t know how far I am going to go with both corpora I want to keep it simple.

One of the appropriate techniques (if not the best)

I could do it using LSA (in this particular case Latent semantic analysis, and not Lysergic acid amide!) or to be more precise Latent semantic indexing.

As this technical report shows, it can perform well in that kind of case: Comparison of Human and Latent Semantic Analysis (LSA) Judgements of Pairwise Document Similarities for a News Corpus,  B. Pincombe, Australian Department of Defence, 2004. (full text available here or through any good search engine, see previous post)

This could be an issue for later research.

The approach that I am working on (not quick and dirty but simpler and hopefully robust)

For exercise …

more ...