Bits of Language: corpus linguistics, NLP and text analytics

Using RSS and Atom feeds to collect web pages with Python

This post describes practical ways to find recent URLs within a website and to extract text, metadata, and comments. It contains all necessary code snippets to optimize link discovery and document filtering.

more ...

Ad hoc and general-purpose corpus construction from web sources

The diversity and quantity of texts present on the Internet have to be better assessed to allow for the description of language with its diversity and change. Focusing on actual construction processes leads to better corpus design, beyond simple collections or heterogeneous resources.

more ...