Introducing the Microblog Explorer

The Microblog Explorer project is about gathering URLs from social networks (FriendFeed, identi.ca, and Reddit) to use them as web crawling seeds. At least by the last two of them a crawl appears to be manageable in terms of both API accessibility and corpus size, which is not the case concerning Twitter for example.

Hypotheses:

  1. These platforms account for a relative diversity of user profiles.
  2. Documents that are most likely to be important are being shared.
  3. It becomes possible to cover languages which are more rarely seen on the Internet, below the English-speaking spammer’s radar.
  4. Microblogging services are a good alternative to overcome the limitations of seed URL collections (as well as the biases implied by search engine optimization techniques and link classification).

Characteristics so far:

  • The messages themselves are not being stored (links are filtered on the fly using a series of heuristics).
  • The URLs that are obviously pointing to media documents are discarded, as the final purpose is to be able to build a text corpus.
  • This approach is ‘static’, as it does not rely on any long poll requests, it actively fetches the required pages.
  • Starting from the main public timeline, the scripts aim at …
more ...