Bits of Language: corpus linguistics, NLP and text analytics

On the creation and use of social media resources

Reflexions after a workshop on computer-mediated communication and social media: Besides the consensus on tweet IDs as exchange currency for replication studies, open questions remain concerning data re-use for existing linguistic archives

more ...

Collection and indexing of tweets with a geographical focus

This paper introduces a Twitter corpus focused geographically in order to (1) test selection and collection processes for a given region and (2) find a suitable database to query, filter, and visualize the tweets.

Barbaresi, A. (2016). Collection and Indexing of Tweets with a Geographical Focus, in Proceedings of the 4th Workshop on Challenges in the Management of Large Corpora (LREC 2016), pp. 24-27.

Why Twitter?

“To do linguistics on texts is to do botanics on a herbarium, and zoology on remains of more or less well-preserved animals.”

“Faire de la linguistique sur des textes, c’est faire de la …

more ...

Introducing the Microblog Explorer

The Microblog Explorer project is about gathering URLs from social networks (FriendFeed, identi.ca, and Reddit) to use them as web crawling seeds. At least by the last two of them a crawl appears to be manageable in terms of both API accessibility and corpus size, which is not the case concerning Twitter for example.

Hypotheses:

These platforms account for a relative diversity of user profiles.
Documents that are most likely to be important are being shared.
It becomes possible to cover languages which are more rarely seen on the Internet, below the English-speaking spammer’s radar.
Microblogging services are …

more ...

Microsoft to analyze social networks to determine comprehension level

I recently read that Microsoft was planning to analyze several social networks in order to know more about users, so that the search engine could deliver more appropriate results. See this article on geekwire.com : Microsoft idea: Analyze social networks posts to deduce mood, interests, education.

Among the variables that are considered, the ‘sophistication and education level’ of the posts is mentionned. This is highly interesting, because it assumes a double readability assessment, on the reader’s side and on the side of the search engine. More precisely, this could refer to a classification task.

Here is an extract of …

more ...