On the interest of social media corpora

Introduction

The necessity to study language use in computer-mediated communication (CMC) appears to be of common interest, as online communication is ubiquitous and raises a series of ethical, sociological, technological and technoscientific issues among the general public. The importance of linguistic studies on CMC is acknowledged beyond the researcher community, for example in forensic analysis, since evidence can be found online and traced back to its author.

In a South Park episode (“Fort Collins”, episode 6 season 20), a school girl performs “emoji analysis” to get information on the author of troll messages. Using the distribution of emojis, she concludes that this person cannot be the suspected primary school student but has to be an adult.

Workshop

I recently attended a workshop organized by the H2020-project CLARIN-PLUS on this topic. I wrote a blog post on the CLARIN blog: Reflections on the CLARIN-PLUS workshop “Creation and Use of Social Media Resources”

Ethical remark

In any case, gathering CMC data in one place and making it accessible on a massive scale to scientific apparatuses (for example indexing or user-related metadata) understandably raises concerns related to the human lives and interactions which are captured by, hidden in, or which enfold beyond the data. The debate among the research community (especially institutions funded by public money) is all the more necessary since corporations whose business model resides in the ongoing collection and storage of information made publicly available or propagated through a social network are not likely to voice concerns about these digital aggregates.

Relevant work published so far

I have so far been involved with computer-based communication in several ways: data gathering from social networks (defunct ones like identi.ca or FriendFeed but also Twitter and Reddit), indexing (selection of relevant sources, conversion to XML TEI, dashboard for queries and visualizations), and linguistic analysis (particular or standard style, comparison to reference corpora, out-of-vocabulary tokens).

Research work

Barbaresi A. Crawling microblogging services to gather language-classified URLs. Workflow and case study, Proceedings of ACL SRW, pp. 9-15, 2013.
Barbaresi A. Finding viable seed URLs for web corpora: A scouting approach and comparative study of available sources, Proceedings of 9th Web as Corpus Workshop (WaC-9), pp. 1-8, 2014.
Barbaresi A., Würzner K.-M. For a fistful of blogs: Discovery and comparative benchmarking of republishable German content KONVENS 2014, NLP4CMC workshop proceedings, Hildesheim University Press, pp. 2-10, 2014.
Barbaresi A. Collection, Description, and Visualization of the German Reddit Corpus, Proceedings of 2nd Workshop on Natural Language Processing for Computer-Mediated Communication, pp. 7-11, 2015.
Barbaresi A. Collection and Indexing of Tweets with a Geographical Focus, Proceedings of the 4th Workshop on Challenges in the Management of Large Corpora (CMLC), ELRA, pp. 24-27, 2016.

On the interest of social media corpora

Introduction

Workshop

Ethical remark

Relevant work published so far

Research work

On this blog (in chronological order)

Code released under open-source licenses

Introduction

Workshop

Ethical remark

Relevant work published so far

Research work

On this blog (in chronological order)

Code released under open-source licenses

Related Posts: