Collecting and indexing tweets with a geographical focus

This paper introduces a Twitter corpus focused geographically in order to (1) test selection and collection processes for a given region and (2) find a suitable database to query, filter, and visualize the tweets.

Why Twitter?

To do linguistics on texts is to do botanics on a herbarium, and zoology on remains of more or less well-preserved animals.”

Faire de la linguistique sur des textes, c’est faire de la botanique sur un herbier, de la zoologie sur des dépouilles d’animaux plus ou moins conservées.”
Inaugural speech from Charles de Tourtoulon at the Académie des sciences, agriculture, arts et belles lettres, Aix-en-Provence, 1897. (For a detailed interpretation see the introduction of my PhD thesis)

Practical reasons

  • (Lui & Baldwin 2014)

    • A frontier area due to their dissimilarity with existing corpora
  • (Krishnamurthy et al. 2008)

    • Availability and ease of use
    • Immediacy of the information presented
    • Volume and variability of the data contained
    • Presence of geolocated messages

My study of 2013 concerning other social networks (Crawling microblogging services to gather language-classified URLs ...

more ...

Introducing the Microblog Explorer

The Microblog Explorer project is about gathering URLs from social networks (FriendFeed,, and Reddit) to use them as web crawling seeds. At least by the last two of them a crawl appears to be manageable in terms of both API accessibility and corpus size, which is not the case concerning Twitter for example.


  1. These platforms account for a relative diversity of user profiles.
  2. Documents that are most likely to be important are being shared.
  3. It becomes possible to cover languages which are more rarely seen on the Internet, below the English-speaking spammer’s radar.
  4. Microblogging services are a good alternative to overcome the limitations of seed URL collections (as well as the biases implied by search engine optimization techniques and link classification).

Characteristics so far:

  • The messages themselves are not being stored (links are filtered on the fly using a series of heuristics).
  • The URLs that are obviously pointing to media documents are discarded, as the final purpose is to be able to build a text corpus.
  • This approach is ‘static’, as it does not rely on any long poll requests, it actively fetches the required pages.
  • Starting from the main public timeline, the scripts aim at ...
more ...

Microsoft to analyze social networks to determine comprehension level

I recently read that Microsoft was planning to analyze several social networks in order to know more about users, so that the search engine could deliver more appropriate results. See this article on : Microsoft idea: Analyze social networks posts to deduce mood, interests, education.

Among the variables that are considered, the ‘sophistication and education level’ of the posts is mentionned. This is highly interesting, because it assumes a double readability assessment, on the reader’s side and on the side of the search engine. More precisely, this could refer to a classification task.

Here is an extract of a patent describing how this is supposed to work.

[0117] In addition to skewing the search results to the user’s inferred interests, the user-following engine 112 may further tailor the search results to a user’s comprehension level. For example, an intelligent processing module 156 may be directed to discerning the sophistication and education level of the posts of a user 102. Based on that inference, the customization engine may vary the sophistication level of the customized search result 510. The user-following engine 112 is able to make determinations about comprehension level several ways, including from a user’s ...

more ...