The Microblog Explorer project is about gathering URLs from social networks (FriendFeed, identi.ca, and Reddit) to use them as web crawling seeds. At least by the last two of them a crawl appears to be manageable in terms of both API accessibility and corpus size, which is not the case concerning Twitter for example.
- These platforms account for a relative diversity of user profiles.
- Documents that are most likely to be important are being shared.
- It becomes possible to cover languages which are more rarely seen on the Internet, below the English-speaking spammer’s radar.
- Microblogging services are a good alternative to overcome the limitations of seed URL collections (as well as the biases implied by search engine optimization techniques and link classification).
Characteristics so far:
- The messages themselves are not being stored (links are filtered on the fly using a series of heuristics).
- The URLs that are obviously pointing to media documents are discarded, as the final purpose is to be able to build a text corpus.
- This approach is ‘static’, as it does not rely on any long poll requests, it actively fetches the required pages.
- Starting from the main public timeline, the scripts aim at ...