Bits of Language: corpus linguistics, NLP and text analytics

Using sitemaps to crawl websites on the command-line

Sitemaps are particularly useful for web crawling, so that machines can more intelligently crawl the site. The post entails all necessary code snippets to optimize link discovery and filtering and to work with sitemaps on the command-line.

more ...

Filtering links to gather texts on the web

Courlan is a command-line tool and Python library designed to clean, filter, normalize, and sample URLs. Its primary purpose is to optimize web crawling by focusing on web pages containing primarily spam-free text in a target language.

more ...

Evaluation of date extraction tools for Python

htmldate performs better than the other Python solutions, it is also noticeably faster. Especially for smaller news outlets, websites and blogs, as well as pages written in languages other than English, it greatly extends date extraction coverage without sacrificing precision.

more ...

Evaluating scraping and text extraction tools for Python

Python packages are compared with respect to robustness and speed. Raw text extraction of boilerplate and content segments reveals which web scraping tool is more adapted to the html2text challenge.

more ...

Validating TEI-XML documents with Python

Here are two ways to validate XML documents in Python according the guidelines of the Text Encoding Initiative. The tutorial shows how to parse and to validate XML documents, using a shortcut or detailing each step.

more ...

Two studies on toponyms in literary texts

Two studies centering on visualization of place names in literary texts are introduced. Research on toponym extraction is discussed and in an interdisciplinary perspective: Distant reading and digital literary studies are not mere numeric accounts.

more ...

Extracting the main text content from web pages using Python

Trafilatura is a Python library designed to download, parse, and scrape web page data. It also offers tools that can easily help with website navigation and extraction of links from sitemaps and feeds.

more ...

Franco-German workshop series on the historical illustrated press

I wrote a blog post on the Franco-German conference and workshop series I am co-organizing with Claire Aslangul (University Paris-Sorbonne) and Bérénice Zunino (University of Franche-Comté). The three events planned revolve around the same topic: the illustrated press in France and Germany from the end of the 19th to the middle of the 20th century, drawing from disciplinary fields as diverse as visual history and computational linguistics. A first workshop will take place in Besançon in April, then a larger conference will be hosted by the Maison Heinrich Heine in Paris at the end of 2018, and finally a workshop …

more ...

On the creation and use of social media resources

Reflexions after a workshop on computer-mediated communication and social media: Besides the consensus on tweet IDs as exchange currency for replication studies, open questions remain concerning data re-use for existing linguistic archives

more ...

A module to extract date information from web pages

Date is a critical component for web archives since it is one of the few metadata that are relevant for philologists and information scientists alike. The Python library htmldate provides a way to extract the creation or modification date of web pages.

more ...