Bits of Language: corpus linguistics, NLP and text analytics

Web scraping with R: Text and metadata extraction

Why choose between R and Python when you can have both? This tutorial shows how to install a Python scraper and use it for content discovery and text extraction, all straight from R.

more ...

Filtering links to gather texts on the web

Courlan is a command-line tool and Python library designed to clean, filter, normalize, and sample URLs. Its primary purpose is to optimize web crawling by focusing on web pages containing primarily spam-free text in a target language.

more ...

Evaluation of date extraction tools for Python

htmldate performs better than the other Python solutions, it is also noticeably faster. Especially for smaller news outlets, websites and blogs, as well as pages written in languages other than English, it greatly extends date extraction coverage without sacrificing precision.

more ...

Evaluating scraping and text extraction tools for Python

Python packages are compared with respect to robustness and speed. Raw text extraction of boilerplate and content segments reveals which web scraping tool is more adapted to the html2text challenge.

more ...