On the interest of social media corpora

Introduction

The necessity to study language use in computer-mediated communication (CMC) appears to be of common interest, as online communication is ubiquitous and raises a series of ethical, sociological, technological and technoscientific issues among the general public. The importance of linguistic studies on CMC is acknowledged beyond the researcher community, for example in forensic analysis, since evidence can be found online and traced back to its author.

In a South Park episode (“Fort Collins”, episode 6 season 20), a school girl performs “emoji analysis” to get information on the author of troll messages. Using the distribution of emojis, she concludes that this person cannot be the suspected primary school student but has to be an adult.

Workshop

I recently attended a workshop organized by the H2020-project CLARIN-PLUS on this topic. I wrote a blog post on the CLARIN blog: Reflections on the CLARIN-PLUS workshop “Creation and Use of Social Media Resources”

Ethical remark

In any case, gathering CMC data in one place and making it accessible on a massive scale to scientific apparatuses (for example indexing or user-related metadata) understandably raises concerns related to the human lives and interactions which are captured by, hidden in, or which enfold ...

more ...

Finding viable seed URLs for web corpora

I recently attended the Web as Corpus Workshop in Gothenburg, where I had a talk for a paper of mine, Finding viable seed URLs for web corpora: a scouting approach and comparative study of available sources, and another with Felix Bildhauer and Roland Schäfer, Focused Web Corpus Crawling.

Summary

The comparison I did started from web crawling experiments I performed at the FU Berlin. The fact is that the conventional tools of the “Web as Corpus” framework rely heavily on URLs obtained from search engines. URLs were easily gathered that way until search engine companies restricted this allowance, meaning that one now has to pay and/or to wait longer to send queries.

I tried to evaluate the leading approach and to find decent subtitutes using social networks as well as the Open Directory Project and Wikipedia. I take four different languages (Dutch, French, Indonesian and Swedish) as examples in order to compare several web spaces with different if not opposed characteristics.

My results distinguish no clear winner, complementary approaches are called for, and it seems possible to replace or at least to complement the existing BootCaT approach. I think that crawling problems such as link/host diversity have not ...

more ...

A one-pass valency-oriented chunker for German

I recently introduced at the LTC‘13 conference a tool I developed to help performing fast text analysis on web corpora: a one-pass valency-oriented chunker for German.

Motivation

It turns out that topological fields together with chunked phrases provide a solid basis for a robust analysis of German sentence structure.”
E. W. Hinrichs, “Finite-State Parsing of German”, in Inquiries into Words, Constraints and Contexts, A. Arppe and et al. (eds.), Stanford: CSLI Publications, pp. 35–44, 2005.

Abstract

Non-finite state parsers provide fine-grained information but they are computationally demanding, so that it can be interesting to see how far a shallow parsing approach is able to go.

The transducer described here consists in a pattern-based matching operation of POS-tags using regular expressions that takes advantage of the characteristics of German grammar. The process aims at finding linguistically relevant phrases with a good precision, which enables in turn an estimation of the actual valency of a given verb.

The chunker reads its input exactly once instead of using cascades, which greatly benefits computational efficiency.

This finite-state chunking approach does not return a tree structure, but rather yields various kinds of linguistic information useful to the language researcher: possible applications include ...

more ...

What is good enough to become part of a web corpus?

I recently worked at the FU Berlin with Roland Schäfer and Felix Bildhauer on issues related to web corpora. One of them deals with corpus construction: as a matter of fact, web documents can be very different, and even after a proper cleaning it is not rare to see things that could hardly be qualified as texts. While there are indubitably clear cases such as lists of addresses or tag clouds, it is not always obvious to define how extensive the notions of text and corpus are. What’s more, a certain amount of documents just end up too close to call. Nonetheless, this issue has to be addressed, since even a no-decision policy would have consequences, as certain linguistic phenomena become more or less accidentally over- or underrepresented in the final corpus. That is why we believe that linguists and “end users” in general should be aware of this kind of technicalities.

The Good, the Bad, and the Hazy

In an article to be published in the proceedings of the 8th Web as Corpus Workshop, The Good, the Bad, and the Hazy: Design Decisions in Web Corpus Construction, we show that text quality is not always easy to assess ...

more ...