On the creation and use of social media resources

Emoji analysis”

The necessity to study language use in computer-mediated communication (CMC) appears to be of common interest, as online communication is ubiquitous and raises a series of ethical, sociological, technological and technoscientific issues among the general public. The importance of linguistic studies on CMC is acknowledged beyond the researcher community, for example in forensic science, as evidence can be found online and traced back to its author. In a South Park episode (“Fort Collins”, episode 6 season 20), a school girl performs “emoji analysis” to get information on the author of troll messages. Using the distribution of emojis, she concludes that this person cannot be the suspected primary school student but has to be an adult. Although the background story seems somehow far-fatched, as often with South Park, the logic of the analysis is sound.

General impressions on research trends

I recently went to a workshop on computer-mediated communication and social media. I am impressed by the preponderant role of Twitter data, in the focus of a significant number of researchers. This is a open field, with still much to do research on: there seems to be no clear or widely acknowledged methodology and there are diverging approaches concerning ...

more ...

A module to extract date information from web pages

Description

Metadata extraction

Diverse content extraction and scraping techniques are routinely used on web document collections by companies and research institutions alike. Being able to better qualify the contents allows for insights based on metadata (e.g. content type, authors or categories), better bandwidth control (e.g. by knowing when webpages have been updated), or optimization of indexing (e.g. language-based heuristics, LRU cache, etc.).

In short, metadata extraction is useful for different kinds of purposes ranging from knowledge extraction and business intelligence to classification and refined visualizations. It is often necessary to fully parse the document or apply robust scraping patterns, there are for example webpages for which neither the URL nor the server response provide a reliable way to date the document, that is find when it was written.

Context

I regularly work on improving the extraction methods for the web collections at my home institutions. They are unique as they combine both the quantity resulting from broad web crawling and the quality obtained by carefully extracting text and metadata as well as rejecting documents that do not match certain criteria. In that sense, I already published work on methods to derive metadata from web documents in order ...

more ...

On the interest of social media corpora

Introduction

The necessity to study language use in computer-mediated communication (CMC) appears to be of common interest, as online communication is ubiquitous and raises a series of ethical, sociological, technological and technoscientific issues among the general public. The importance of linguistic studies on CMC is acknowledged beyond the researcher community, for example in forensic analysis, since evidence can be found online and traced back to its author.

In a South Park episode (“Fort Collins”, episode 6 season 20), a school girl performs “emoji analysis” to get information on the author of troll messages. Using the distribution of emojis, she concludes that this person cannot be the suspected primary school student but has to be an adult.

Workshop

I recently attended a workshop organized by the H2020-project CLARIN-PLUS on this topic. I wrote a blog post on the CLARIN blog: Reflections on the CLARIN-PLUS workshop “Creation and Use of Social Media Resources”

Ethical remark

In any case, gathering CMC data in one place and making it accessible on a massive scale to scientific apparatuses (for example indexing or user-related metadata) understandably raises concerns related to the human lives and interactions which are captured by, hidden in, or which enfold ...

more ...

Ad hoc and general-purpose corpus construction from web sources

While the pervasiveness of digital communication is undeniable, the numerous traces left by users-customers are collected and used for commercial purposes. The creation of digital research objects should provide the scientific community with ways to access and analyze them. Particularly in linguistics, the diversity and quantity of texts present on the internet have to be better assessed in order to make current text corpora available, allowing for the description of the variety of languages uses and ongoing changes. In addition, transferring the field of analysis from traditional written text corpora to texts taken from the web results in the creation of new tools and new observables. We must therefore provide the necessary theoretical and practical background to establish scientific criteria for research on these texts.

This is the subject of my PhD work which has been performed under the supervision of Benoît Habert and which led to a thesis entitled Ad hoc and general-purpose corpus construction from web sources, defended on June 19th 2015 at the École Normale Supérieure de Lyon to obtain the degree of doctor of philosophy in linguistics.

Methodological considerations

At the beginning of the first chapter the interdisciplinary setting between linguistics, corpus linguistics, and computational linguistics ...

more ...

Indexing text with ElasticSearch

The Lucene-based search engine Elasticsearch is fast and adaptable, so that it suits most demanding configurations, including large text corpora. I use it daily with tweets and began to release the scripts I use to do so. In this post, I give concrete tips for indexation of text and linguistic analysis.

Mapping

You do not need to define a type for the indexed fields, the database can guess it for you, however it speeds up the process and gives more control to use a mapping. The official documentation is extensive and it is sometimes difficult to get a general idea of how to parametrize indexation: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html

Interesting options which are better specified before indexation include similarity scoring as well as term frequencies and positions.

Linguistic analysis

The string data type allows for the definition of the linguistic analysis to be used (or not) during indexation.

Elasticsearch ships with a series of language analysers which can be used for language-aware tokenization and indexation. Given a “text” field in German, here is where it happens in the mapping:

{
  "text": {
    "type" : "string",
    "index" : "analyzed",
    "analyzer" : "german",
   },
 }

Beyond that, it is possible to write ...

more ...

Collection and indexing of tweets with a geographical focus

This paper introduces a Twitter corpus focused geographically in order to (1) test selection and collection processes for a given region and (2) find a suitable database to query, filter, and visualize the tweets.

Why Twitter?

To do linguistics on texts is to do botanics on a herbarium, and zoology on remains of more or less well-preserved animals.”

Faire de la linguistique sur des textes, c’est faire de la botanique sur un herbier, de la zoologie sur des dépouilles d’animaux plus ou moins conservées.”
Inaugural speech from Charles de Tourtoulon at the Académie des sciences, agriculture, arts et belles lettres, Aix-en-Provence, 1897. (For a detailed interpretation see the introduction of my PhD thesis)

Practical reasons

  • (Lui & Baldwin 2014)

    • A frontier area due to their dissimilarity with existing corpora
  • (Krishnamurthy et al. 2008)

    • Availability and ease of use
    • Immediacy of the information presented
    • Volume and variability of the data contained
    • Presence of geolocated messages

My study of 2013 concerning other social networks (Crawling microblogging services to gather language-classified URLs ...

more ...

Distant reading and text visualization

A new paradigm in “digital humanities” – you know, that Silicon Valley of textual studies geared towards neoliberal narrowing of research (highly provocative but interesting read nonetheless)… A new paradigm resides in the belief that understanding language (e.g. literature) is not accomplished by studying individual texts, but by aggregating and analyzing massive amounts of data (Jockers 2013). Because it is impossible for individuals to “read” everything in a large corpus, advocates of distant reading employ computational techniques to “mine” the texts for significant patterns and then use statistical analysis to make statements about those patterns (Wulfman 2014).

One of the first attempts to apply visualization techniques to texts has been the “shape of Shakespeare” by Rohrer (1998). Clustering methods were used to let set emerge among textual data as well as metadata, not only in humanities but also in the case of Web genres (Bretan, Dewe, Hallberg, Wolkert, & Karlgren, 1998). It may seem rudimentary by today’s standards or far from being a sophisticated “view” on literature but the “distant reading” approach is precisely about seeing the texts in another perspective and exploring the corpus interactively. Other examples of text mining approaches enriching visualization techniques include the document atlas of ...

more ...

Foucault and the spatial turn

I would like to share a crucial text by Michel Foucault which I discovered through a recent article by Marko Juvan on geographical information systems (GIS) and literary analysis:

  • Juvan, Marko (2015). From Spatial Turn to GIS-Mapping of Literary Cultures. European Review, 23(1), pp. 81-96.
  • Foucault, Michel (1984). Des espaces autres. Hétérotopies. Architecture, Mouvement, Continuité, 5, pp. 46-49. Originally: Conférence au Cercle d’études architecturales, 14 mars 1967.

The full text including the translation I am quoting from is available on foucault.info. It is available somewhere in Dits et écrits in paper form. If am understand correctly, the translation is from Jay Miskowiec (see this website). It is an absolute bootleg, since it is originally from a lecture and has not been officially planned for publication. Still, Foucault’s prose is as usual really dense and there is much to learn from it. In the course of time, it has become a central text of the so-called “spatial turn”, which has admittedly been introduced by Foucault and Lefebvre in the 1960s and 70s.

In the opening of the text, comparing the 20th with the 19th century, Foucault comes to the idea that our time is one of ...

more ...

Parsing and converting HTML documents to XML format using Python’s lxml

The Internet is vast and full of different things. There are even tutorials explaining how to convert to or from XML formats using regular expressions. While this may work for very simple steps, as soon as exhaustive conversions and/or quality control is needed, working on a parsed document is the way to go.

In this post, I describe how I work using Python’s lxml module. I take the example of HTML to XML conversion, more specifically XML complying with the guidelines of the Text Encoding Initiative, also known as XML TEI.

Installation

A confortable installation is apt-get install python-lxml on Debian/Ubuntu, but the underlying packages may be old. The more pythonic way would be to make sure all the necessary libraries are installed (something like apt-get install libxml2-dev libxslt1-dev python-dev), and then using a package manager such as pip: pip install lxml.

Parsing HTML

Here are the modules required for basic manipulation:

from __future__ import print_function
from lxml import etree, html
from StringIO import StringIO

And here is how to read a file, supposing it is valid Unicode (it is not necessarily the case). The StringIO buffering is probably not the most direct way, but I found ...

more ...

Analysis of the German Reddit corpus

I would like to present work on the major social bookmarking and microblogging platform Reddit, which I recently introduced at the NLP4CMC workshop 2015. The article published in the proceedings is available online: Collection, Description, and Visualization of the German Reddit Corpus.

Basic idea

The work described in the article directly follows from the recent release of the “Reddit comment corpus”: Reddit user Stuck In The Matrix (Jason Baumgartner) made the dataset publicly available on the platform archive.org at the beginning of July 2015 and claimed to have any publicly available comment.

Corpus construction

In order to focus on German comments, I use a two-tiered filter in order to deliver a hopefully well-balanced performance between speed and accuracy. The first filter uses a spell-checking algorithm (delivered by the enchant library), and the second resides in my language identification tool of choice, langid.py.

The corpus is comparatively small (566,362 tokens), due to the fact that Reddit is almost exclusively an English-speaking platform. The number of tokens tagged as proper nouns (NE) is particularly high (14.4\%), which exemplifies the perplexity of the tool itself, for example because the redditors refer to trending and possibly short-lived notions and celebrities ...

more ...