Bits of Language: corpus linguistics, NLP and text analytics

Evaluation of date extraction tools for Python

htmldate performs better than the other Python solutions, it is also noticeably faster. Especially for smaller news outlets, websites and blogs, as well as pages written in languages other than English, it greatly extends date extraction coverage without sacrificing precision.

more ...

Evaluating scraping and text extraction tools for Python

Python packages are compared with respect to robustness and speed. Raw text extraction of boilerplate and content segments reveals which web scraping tool is more adapted to the html2text challenge.

more ...