Introduction

The necessity to study language use in computer-mediated communication (CMC) appears to be of common interest, as online communication is ubiquitous and raises a series of ethical, sociological, technological and technoscientific issues among the general public. The importance of linguistic studies on CMC is acknowledged beyond the researcher community, for example in forensic analysis, since evidence can be found online and traced back to its author.

In a South Park episode (“Fort Collins”, episode 6 season 20), a school girl performs “emoji analysis” to get information on the author of troll messages. Using the distribution of emojis, she concludes that this person cannot be the suspected primary school student but has to be an adult.

Workshop

I recently attended a workshop organized by the H2020-project CLARIN-PLUS on this topic. I wrote a blog post on the CLARIN blog: Reflections on the CLARIN-PLUS workshop “Creation and Use of Social Media Resources”

Ethical remark

In any case, gathering CMC data in one place and making it accessible on a massive scale to scientific apparatuses (for example indexing or user-related metadata) understandably raises concerns related to the human lives and interactions which are captured by, hidden in, or which enfold beyond the data. The debate among the research community (especially institutions funded by public money) is all the more necessary since corporations whose business model resides in the ongoing collection and storage of information made publicly available or propagated through a social network are not likely to voice concerns about these digital aggregates.

Relevant work published so far

I have so far been involved with computer-based communication in several ways: data gathering from social networks (defunct ones like identi.ca or FriendFeed but also Twitter and Reddit), indexing (selection of relevant sources, conversion to XML TEI, dashboard for queries and visualizations), and linguistic analysis (particular or standard style, comparison to reference corpora, out-of-vocabulary tokens).

Research work

On this blog (in chronological order)

Code released under open-source licenses