I would like to present work on the major social bookmarking and microblogging platform Reddit, which I recently introduced at the NLP4CMC workshop 2015. The article published in the proceedings is available online: Collection, Description, and Visualization of the German Reddit Corpus.
The work described in the article directly follows from the recent release of the “Reddit comment corpus”: Reddit user Stuck In The Matrix (Jason Baumgartner) made the dataset publicly available on the platform archive.org at the beginning of July 2015 and claimed to have any publicly available comment.
In order to focus on German comments, I use a two-tiered filter in order to deliver a hopefully well-balanced performance between speed and accuracy. The first filter uses a spell-checking algorithm (delivered by the enchant library), and the second resides in my language identification tool of choice, langid.py.
The corpus is comparatively small (566,362 tokens), due to the fact that Reddit is almost exclusively an English-speaking platform. The number of tokens tagged as proper nouns (NE) is particularly high (14.4\%), which exemplifies the perplexity of the tool itself, for example because the redditors refer to trending and possibly short-lived notions and celebrities ...more ...