I recently introduced at the LTC‘13 conference a tool I developed to help performing fast text analysis on web corpora: a one-pass valency-oriented chunker for German.


It turns out that topological fields together with chunked phrases provide a solid basis for a robust analysis of German sentence structure.” E. W. Hinrichs, “Finite-State Parsing of German”, in Inquiries into Words, Constraints and Contexts, A. Arppe and et al. (eds.), Stanford: CSLI Publications, pp. 35–44, 2005.


Non-finite state parsers provide fine-grained information but they are computationally demanding, so that it can be interesting to see how far a shallow parsing approach is able to go.

The transducer described here consists in a pattern-based matching operation of POS-tags using regular expressions that takes advantage of the characteristics of German grammar. The process aims at finding linguistically relevant phrases with a good precision, which enables in turn an estimation of the actual valency of a given verb.

The chunker reads its input exactly once instead of using cascades, which greatly benefits computational efficiency.

This finite-state chunking approach does not return a tree structure, but rather yields various kinds of linguistic information useful to the language researcher: possible applications include simulation of text comprehension on the syntactical level, creation of selective benchmarks and failure analysis.


This figure shows a simplified version of the pattern used, for illustration purposes:

Slide from the the talk

Slide from the the talk

For more information

A. Barbaresi, “A one-pass valency-oriented chunker for German“, in Human Language Technologies as a Challenge for Computer Science and Linguistics, Proceedings of the 6th Language & Technology Conference, Zygmunt Vetulani and Hans Uszkoreit (eds.), pp. 157-161, Poznan, 2013.

Article and slides are available here: http://halshs.archives-ouvertes.fr/halshs-00919397

A proof of concept is available on GitHub: https://github.com/adbar/valency-oriented-chunker

Selected references on shallow parsing

  • Abney, S. P., 1991. “Parsing by chunks”. Principle-based parsing, 44:257–278.
  • Barbaresi, A., 2011. “Approximation de la complexité perçue, méthode d’analyse”. In Actes TALN’2011/RECITAL, pp. 229-234.
  • Hobbs, J. R., Appelt, D., Bear, J., Israel, D., Kameyama, M., Stickel, M., and Tyson, M., 1997. “FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text”. Finite-State Language Processing:383–406.
  • Kermes, H. and Evert, S. 2002. “YAC – A Recursive Chunker for Unrestricted German Text”. In Proceedings of the 3rd International Conference on Language Resources and Evaluation, vol. 5.
  • Neumann, G., Backofen R., Baur J., Becker M., and Braun C., 1997. “An Information Extraction Core System for Real World German Text Processing”. In Proceedings of the Fifth Conference on Applied Natural Language Processing. Association for Computational Linguistics.
  • Pereira, F., 1990. “Finite-state approximations of grammars”. In Proceedings of the Annual Meeting of the ACL.
  • Riloff, E. and Phillips, W., 2004. “An Introduction to the Sundance and AutoSlog Systems”. Technical report, School of Computing, University of Utah.
  • Schiehlen, M., 2003. “A Cascaded Finite-State Parser for German”. In Proceedings of the 10th conference of the EACL, vol. 2.
  • Voss, M. J., 2005. “Determining syntactic complexity using very shallow parsing”. Master’s thesis, CASPR, Artificial Intelligence Center, University of Georgia.