The Microblog Explorer project is about gathering URLs from social networks (FriendFeed, identi.ca, and Reddit) to use them as web crawling seeds. At least by the last two of them a crawl appears to be manageable in terms of both API accessibility and corpus size, which is not the case concerning Twitter for example.
- These platforms account for a relative diversity of user profiles.
- Documents that are most likely to be important are being shared.
- It becomes possible to cover languages which are more rarely seen on the Internet, below the English-speaking spammer’s radar.
- Microblogging services are a good alternative to overcome the limitations of seed URL collections (as well as the biases implied by search engine optimization techniques and link classification).
Characteristics so far:
- The messages themselves are not being stored (links are filtered on the fly using a series of heuristics).
- The URLs that are obviously pointing to media documents are discarded, as the final purpose is to be able to build a text corpus.
- This approach is ‘static’, as it does not rely on any long poll requests, it actively fetches the required pages.
- Starting from the main public timeline, the scripts aim at finding interesting users or friends of users.
Regarding the first three, the scripts are just a few tweaks away from delivering this kind of content. Feel free to contact me if you want them to suit your needs. Other interests include microtext corpus building and analysis, social network sampling or network visualization, but they are not my priority right now.
FriendFeed seems to be the most active of the three microblogging services considered. It works as an aggregator, which makes it interesting.
No explicit API limits are enforced, but too much is too much and it leads to non-responding servers.
Among the options I developed I would like to highlight a so-called ‘smart deep crawl’ which targets the interesting users and friends, i.e. the ones by which a significant number of relevant URLs was found or is expected to be found.
identi.ca (public timeline closed in Feb. 2013)
identi.ca is built on open source tools and open standards, which is why I chose to crawl it first. The Microblog Explorer enabled to gather external and internal links. Advantages included the CC license of the messages, the absence of limitations (to my knowledge) and the relative small amount of messages (which can also be a problem).
The Microblog Explorer featured an hourly crawl, which scanned a few pages of the public timeline, and a long-distance miner, which fetched a given list of users and analyzed them one by one.
There are 15 target languages available so far : Croatian, Czech, Danish, Finnish, French, German, Hindi, Italian, Norse, Polish, Portuguese, Romanian, Russian, Spanish and Swedish.
Target languages are defined using subreddits (via so-called ‘multi-reddit expressions’). Here is an example to target possibly Norwegian users: http://www.reddit.com/r/norge+oslo+norskenyheter
Sadly, it is currently not possible to go back in time further than the 500th oldest post due to API limitations. Experience shows that user traversals as well as weekly crawls help to address this issue by cumulating a small but significant number of URLs.
The two main problems I tried to address deal with spam and numerous URLs that link to web pages in English. My take is that the networks analyzed here tend to be dominated by English-speaking users and spammers.
The URL harvesting works as follows: during a social network traversal, obvious spam and URLs leading to non-text documents are filtered out, then in some cases the short message is analyzed by a spell checker in order to see if it could be English text, optional record of user IDs for later crawls.
Using a spell checker (enchant and its library for Python), the scripts use thresholds (expressed as a percentage of tokens which do not pass the spell check) in order to discriminate between links whose titles are mostly English and others, which are thus expected to be in another language. This operation often cuts the amount of microtexts in half and enables to select particular users. Tests show that the probability to find URLs that lead to English text is indeed much higher concerning the lists considered as ‘suspicious’. This option can be deactivated.
This approach can be used with other languages as well but I did not try it so far. There is no language filtering on identi.ca as the number of URLs remaining after the spam filter stays small enough to gather them all.
The technology-prone users account for numerous short messages which over-represent their own interests and hobbies, and there is nothing to do (or to filter) about it…
The code is available on GitHub: * https://github.com/adbar/microblog-explorer
- A. Barbaresi, “Crawling microblogging services to gather language-classified URLs. Workflow and case study“, in 51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop, Sofia, Bulgaria, 2013, pp. 9-15.