Using RSS and Atom feeds to collect web pages with Python

Feeds are a convenient way to get hold of a website’s most recent publications. Used alone or together with text extraction, they can allow for regular updates to databases of web documents. That way, recent documents can be collected and stored for further use. Furthermore, it can be useful to download the portions of a website programmatically to save time and resources, so that a feed-based approach is light-weight and easier to maintain.

This post describes practical ways to find recent URLs within a website and to extract text, metadata, and comments. It contains all necessary code snippets to …

more ...

Using sitemaps to crawl websites on the command-line

This post describes practical ways to crawl websites and by working with sitemaps on the command-line. Sitemaps are particularly useful since they are made so that machines can more intelligently crawl the site. The post entails all necessary code snippets to optimize link discovery and filtering.

more ...