Bits of Language: corpus linguistics, NLP and text analytics

Web scraping with Trafilatura just got faster

HTML to text extraction just got faster with the dedicated Trafilatura software as measured on the benchmark available on the repository. These follows from from two major changes in the package dependencies charset_normalizer and jusText.

more ...

An easy way to save time and resources: content-aware URL filtering

Avoid wasting bandwidth capacity and processing time for webpages which are probably not worth the effort. Stay away from pages with little text in the target language or focus on other pages to gather links.

more ...

Using RSS and Atom feeds to collect web pages with Python

This post describes practical ways to find recent URLs within a website and to extract text, metadata, and comments. It contains all necessary code snippets to optimize link discovery and document filtering.

more ...