Ad hoc and general-purpose corpus construction from web sources

While the pervasiveness of digital communication is undeniable, the numerous traces left by users-customers are collected and used for commercial purposes. The creation of digital research objects should provide the scientific community with ways to access and analyze them. Particularly in linguistics, the diversity and quantity of texts present on the internet have to be better assessed in order to make current text corpora available, allowing for the description of the variety of languages uses and ongoing changes. In addition, transferring the field of analysis from traditional written text corpora to texts taken from the web results in the creation …

more ...

Bibliography

Work in progress towards a page listing (web) corpus linguistics references and resources.

Summary

Corpus Linguistics and Corpus Building

  • The Routledge Handbook of Corpus Linguistics, 1 ed., O’Keeffe, A. and McCarthy, M., Eds., London, New York: Routledge, 2010.
  • N. Bubenhofer, Einführung in die Korpuslinguistik: Praktische Grundlagen und Werkzeuge, Zürich:2009.
  • S. Loiseau, “Corpus, quantification et typologie textuelle”, Syntaxe et sémantique, vol. 9, pp. 73-85, 2008.
  • C. Draxler, Korpusbasierte Sprachverarbeitung, Günter Narr, 2008. M. Cori, “Des méthodes de traitement automatique aux linguistiques fondées sur les corpus”, Langages, vol. 171, iss. 3, pp. 95-110 …
more ...

Collection and indexing of tweets with a geographical focus

This paper introduces a Twitter corpus focused geographically in order to (1) test selection and collection processes for a given region and (2) find a suitable database to query, filter, and visualize the tweets.

Why Twitter?

To do linguistics on texts is to do botanics on a herbarium, and zoology on remains of more or less well-preserved animals.”

Faire de la linguistique sur des textes, c’est faire de la …

more ...

Analysis of the German Reddit corpus

I would like to present work on the major social bookmarking and microblogging platform Reddit, which I recently introduced at the NLP4CMC workshop 2015. The article published in the proceedings is available online: Collection, Description, and Visualization of the German Reddit Corpus.

Basic idea

The work described in the article directly follows from the recent release of the “Reddit comment corpus”: Reddit user Stuck In The Matrix (Jason Baumgartner) made the dataset publicly available on the platform archive.org at the beginning of July 2015 and claimed to have any publicly available comment.

Corpus construction

In order to focus on …

more ...

Review of the Czech internet corpus

Web for “old school” balanced corpus

The Czech internet corpus (Spoustová and Spousta 2012) is a good example of focused web corpora built in order to gather an “old school” balanced corpus encompassing different genres and several text types.

The crawled websites are not selected automatically or at random but according to the linguists’ expert knowledge: the authors mention their “knowledge of the Czech Internet” and their experience on “web site popularity”. The whole process as well as the target websites are described as follows:

We have chosen to begin with manually selecting, crawling and cleaning particular web sites with …

more ...

2nd release of the German Political Speeches Corpus

Last Monday, I released an updated version of both corpus and visualization tool on the occasion of the DGfS-CL Poster-Session in Frankfurt, where I presented a poster (in German).

The first version had been made available last summer and mentioned on this blog, cf this post: Introducing the German Political Speeches Corpus and Visualization Tool.

For stability, the resource is available at this permanent redirect: http://purl.org/corpus/german-speeches

Description

In case you don’t remember it or never heard of it, here is a brief description:

The resource presented here consists of speeches by the last German …

more ...

Introducing the German Political Speeches Corpus and Visualization Tool

I am currently working on a resource I would like to introduce : the German Political Speeches Corpus (no acronym apart from GPS). It consists in speeches by the last German Presidents and Chancellors as well as a few ministers, all gathered from official sources.

As far I as know no such corpus was publicly available for German. Most speeches could not be found on Google until today (which is bound to change). It can be freely republished.

The two main corpora (Presidency and Chancellery) are released in XML format basing on raw text and metadata.

There is a series of …

more ...

On Text Linguistics

Talking about text complexity in my last post, I did not realize how important it is to take the framework of text linguistics into account. This branch of linguistics is well-known in Germany but is not really meant as a topic by itself elsewhere. Most of the time, no one makes a distinction between text linguistics and discourse analysis, although the background is not necessarily the same.

I saw a presentation by Jean-Michel Adam last week, who describes himself as the “last of the Mohicans” to use this framework in French research. He drew a comprehensive picture of its origin …

more ...