I would like to present work on the major social bookmarking and microblogging platform Reddit, which I recently introduced at the NLP4CMC workshop 2015. The article published in the proceedings is available online: Collection, Description, and Visualization of the German Reddit Corpus.
Basic idea
The work described in the article directly follows from the recent release of the “Reddit comment corpus”: Reddit user Stuck In The Matrix (Jason Baumgartner) made the dataset publicly available on the platform archive.org at the beginning of July 2015 and claimed to have any publicly available comment.
Corpus construction
In order to focus on German comments, I use a two-tiered filter in order to deliver a hopefully well-balanced performance between speed and accuracy. The first filter uses a spell-checking algorithm (delivered by the enchant library), and the second resides in my language identification tool of choice, langid.py.
The corpus is comparatively small (566,362 tokens), due to the fact that Reddit is almost exclusively an English-speaking platform. The number of tokens tagged as proper nouns (NE) is particularly high (14.4\%), which exemplifies the perplexity of the tool itself, for example because the redditors refer to trending and possibly short-lived notions and celebrities, or because of a high proportion of short, elliptic comments which fail to provide enough morpho-syntactic context. The comments seem to be relatively evenly distributed across channels and user names.
A more detailed anaylsis of linguistic features is available in the paper.
Visualization of place names
Geographical information about the places names has been compiled from the Geonames database. The tokenized corpus has been filtered and matched with the database. The maps below have been generated and customized using TileMill
Maps of extracted place names (Europe on the left, Central Europe with names of frequent places on the right).
The visualizations show that most of the time the places mentioned in the corpus seem to be located in German-speaking countries. In addition to the language cues detected, the actual content corroborates the hypothesis that the selection process is efficient.
There is one notable error on the map: the city of Reus (close to Barcelona) is prominent, but it is because of the name of German soccer player Marco Reus. There are ways to disambiguate place names from person names, but it can be tricky. I may publish a post on this topic in the future.
DIY Reddit Corpora
Since the license restrictions concerning the dataset are unclear, the corpus is only available upon request. However, I think that for such topics reproducibility of research is a must, I put online a Python script which yields the basis of the work by selecting comments in German (on GitHub), and which can be easily adapted to target other languages.
Conclusion
I have shown how a corpus focusing on German can be built using the publicly available Reddit comment dataset. In order to get a first impression of the corpus, I collected quantitative information and offered a visualization of structured data, more precisely place names which have to be extracted from the comments since they are not geotagged.
Jason Baumgartner announced that he would keep the data up-to-date, let’s see what the future brings.