Bits of Language: corpus linguistics, NLP and text analytics

A one-pass valency-oriented chunker for German

I recently introduced at the LTC‘13 conference a tool I developed to help performing fast text analysis on web corpora: a one-pass valency-oriented chunker for German.

Motivation

“It turns out that topological fields together with chunked phrases provide a solid basis for a robust analysis of German sentence structure.” E. W. Hinrichs, “Finite-State Parsing of German”, in Inquiries into Words, Constraints and Contexts, A. Arppe and et al. (eds.), Stanford: CSLI Publications, pp. 35–44, 2005.

Abstract

Non-finite state parsers provide fine-grained information but they are computationally demanding, so that it can be interesting to see how far a …

more ...

Building a topic-specific corpus out of two different corpora

I have (say, I crawled two websites and got hold of) two corpora which sometimes focus on the same topics. I would like to try and melt them together in order to build a balanced and coherent corpus. As this is a highly discussed research topic there are plenty of subtle ways to do it.

Still, as I am only at the beginning of my research and as I don’t know how far I am going to go with both corpora I want to keep it simple.

One of the appropriate techniques (if not the best)

I could do …

more ...