The necessity to study language use in computer-mediated communication (CMC) appears to be of common interest, as online communication is ubiquitous and raises a series of ethical, sociological, technological and technoscientific issues among the general public. The importance of linguistic studies on CMC is acknowledged beyond the researcher community, for example in forensic analysis, since evidence can be found online and traced back to its author.
In a South Park episode (“Fort Collins”, episode 6 season 20), a school girl performs “emoji analysis” to get information on the author of troll messages. Using the distribution of emojis, she concludes that this person cannot be the suspected primary school student but has to be an adult.
I recently attended a workshop organized by the H2020-project CLARIN-PLUS on this topic. I wrote a blog post on the CLARIN blog: Reflections on the CLARIN-PLUS workshop “Creation and Use of Social Media Resources”
In any case, gathering CMC data in one place and making it accessible on a massive scale to scientific apparatuses (for example indexing or user-related metadata) understandably raises concerns related to the human lives and interactions which are captured by, hidden in, or which enfold beyond the data. The debate among the research community (especially institutions funded by public money) is all the more necessary since corporations whose business model resides in the ongoing collection and storage of information made publicly available or propagated through a social network are not likely to voice concerns about these digital aggregates.
Relevant work published so far
I have so far been involved with computer-based communication in several ways: data gathering from social networks (defunct ones like identi.ca or FriendFeed but also Twitter and Reddit), indexing (selection of relevant sources, conversion to XML TEI, dashboard for queries and visualizations), and linguistic analysis (particular or standard style, comparison to reference corpora, out-of-vocabulary tokens).
- Barbaresi A. Crawling microblogging services to gather language-classified URLs. Workflow and case study, Proceedings of ACL SRW, pp. 9-15, 2013.
- Barbaresi A. Finding viable seed URLs for web corpora: A scouting approach and comparative study of available sources, Proceedings of 9th Web as Corpus Workshop (WaC-9), pp. 1-8, 2014.
- Barbaresi A., Würzner K.-M. For a fistful of blogs: Discovery and comparative benchmarking of republishable German content KONVENS 2014, NLP4CMC workshop proceedings, Hildesheim University Press, pp. 2-10, 2014.
- Barbaresi A. Collection, Description, and Visualization of the German Reddit Corpus, Proceedings of 2nd Workshop on Natural Language Processing for Computer-Mediated Communication, pp. 7-11, 2015.
- Barbaresi A. Collection and Indexing of Tweets with a Geographical Focus, Proceedings of the 4th Workshop on Challenges in the Management of Large Corpora (CMLC), ELRA, pp. 24-27, 2016.
On this blog (in chronological order)
- Introducing the Microblog Explorer
- Challenges in web corpus construction for low-resource languages
- Guessing if a URL points to a WordPress blog
- Finding viable seed URLs for web corpora
- Analysis of the German Reddit corpus
- Collecting and indexing tweets with a geographical focus
- Indexing text with ElasticSearch