Using RSS and Atom feeds to collect web pages with Python

Feeds are a convenient way to get hold of a website’s most recent publications. Used alone or together with text extraction, they can allow for regular updates to databases of web documents. That way, recent documents can be collected and stored for further use. Furthermore, it can be useful to download the portions of a website programmatically to save time and resources, so that a feed-based approach is light-weight and easier to maintain.

This post describes practical ways to find recent URLs within a website and to extract text, metadata, and comments. It contains all necessary code snippets to …

more ...

Ad hoc and general-purpose corpus construction from web sources

While the pervasiveness of digital communication is undeniable, the numerous traces left by users-customers are collected and used for commercial purposes. The creation of digital research objects should provide the scientific community with ways to access and analyze them. Particularly in linguistics, the diversity and quantity of texts present on the internet have to be better assessed in order to make current text corpora available, allowing for the description of the variety of languages uses and ongoing changes. In addition, transferring the field of analysis from traditional written text corpora to texts taken from the web results in the creation …

more ...