Bits of Language: corpus linguistics, NLP and text analytics

Using RSS and Atom feeds to collect web pages with Python

This post describes practical ways to find recent URLs within a website and to extract text, metadata, and comments. It contains all necessary code snippets to optimize link discovery and document filtering.

more ...

Using sitemaps to crawl websites on the command-line

Sitemaps are particularly useful for web crawling, so that machines can more intelligently crawl the site. The post entails all necessary code snippets to optimize link discovery and filtering and to work with sitemaps on the command-line.

more ...