Bits of Language: corpus linguistics, NLP and text analytics

How to download web pages in parallel and follow politeness rules in Python

Optimizing downloads is crucial to gather data from a series of websites. However, one should respect “politeness” rules. Here is a simple way keep an eye on all these constraints as once.

more ...

An easy way to save time and resources: content-aware URL filtering

Avoid wasting bandwidth capacity and processing time for webpages which are probably not worth the effort. Stay away from pages with little text in the target language or focus on other pages to gather links.

more ...

Web scraping with R: Text and metadata extraction

Why choose between R and Python when you can have both? This tutorial shows how to install a Python scraper and use it for content discovery and text extraction, all straight from R.

more ...

Using RSS and Atom feeds to collect web pages with Python

This post describes practical ways to find recent URLs within a website and to extract text, metadata, and comments. It contains all necessary code snippets to optimize link discovery and document filtering.

more ...

Validating TEI-XML documents with Python

Here are two ways to validate XML documents in Python according the guidelines of the Text Encoding Initiative. The tutorial shows how to parse and to validate XML documents, using a shortcut or detailing each step.

more ...

Extracting the main text content from web pages using Python

Trafilatura is a Python library designed to download, parse, and scrape web page data. It also offers tools that can easily help with website navigation and extraction of links from sitemaps and feeds.

more ...

A module to extract date information from web pages

Date is a critical component for web archives since it is one of the few metadata that are relevant for philologists and information scientists alike. The Python library htmldate provides a way to extract the creation or modification date of web pages.

more ...

Indexing text with ElasticSearch

The Lucene-based search engine Elasticsearch is fast and adaptable, so that it suits most demanding configurations, including large text corpora. I use it daily with tweets and began to release the scripts I use to do so. In this post, I give concrete tips for indexation of text and linguistic analysis.

Mapping

You do not need to define a type for the indexed fields, the database can guess it for you, however it speeds up the process and gives more control to use a mapping. The official documentation is extensive and it is sometimes difficult to get a general idea …

more ...

Parsing and converting HTML documents to XML format using Python’s lxml

The Internet is vast and full of different things. There are even tutorials explaining how to convert to or from XML formats using regular expressions. While this may work for very simple steps, as soon as exhaustive conversions and/or quality control is needed, working on a parsed document is the way to go.

In this post, I describe how I work using Python’s lxml module. I take the example of HTML to XML conversion, more specifically XML complying with the guidelines of the Text Encoding Initiative, also known as XML TEI.

Update: I released a Python module that …

more ...

Rule-based URL cleaning for text collections

I would like to introduce the way I clean lists of unknown URLs before going further (e.g. by retrieving the documents). I often use a Python script named clean_urls.py which I made available under a open-source license as a part of the FLUX-toolchain.

The following Python-based regular expressions show how malformed URLs, URLs leading to irrelevant content as well as URLs which obviously lead to adult content and spam can be filtered using a rule-based approach.

Avoid recurrent sites and patterns to save bandwidth

First, it can be useful to make sure that the URL was properly …

more ...