Using a rule-based tokenizer for German

Tokenization is a text segmentation process whose objective resides in dividing written text into meaningful units. This post introduces two rule-based methods to perform tokenization on German, English and beyond.

more ...

Using RSS and Atom feeds to collect web pages with Python

Feeds are a convenient way to get hold of a website’s most recent publications. Used alone or together with text extraction, they can allow for regular updates to databases of web documents. That way, recent documents can be collected and stored for further use. Furthermore, it can be useful to download the portions of a website programmatically to save time and resources, so that a feed-based approach is light-weight and easier to maintain.

This post describes practical ways to find recent URLs within a website and to extract text, metadata, and comments. It contains all necessary code snippets to …

more ...

A simple multilingual lemmatizer for Python

Task at hand: lemmatization ≠ stemming

In computer science, canonicalization (also known as standardization or normalization) is a process for converting data that has more than one possible representation into a standard, normal, or canonical form. In morphology and lexicography, a lemma is the canonical form of a set of words. In English, for example, run, runs, ran and running are forms of the same lexeme (run) which can be selected to represent all its possible forms.

Lemmatization is the process of grouping together the inflected forms of a word so they can be analyzed as a single item, identified by …

more ...

Using sitemaps to crawl websites on the command-line

This post describes practical ways to crawl websites and by working with sitemaps on the command-line. Sitemaps are particularly useful since they are made so that machines can more intelligently crawl the site. The post entails all necessary code snippets to optimize link discovery and filtering.

more ...

Filtering links to gather texts on the web

The issue with URLs and URIs

A Uniform Resource Locator (URL), colloquially termed a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifier (URI).

Both navigation on the Web and web crawling rely on the assumption that “the Web is a space in which resources are identified by Uniform Resource Identifiers (URIs).” (Berners-Lee et al., 2006) That being said, URLs cannot be expected to be entirely reliable. Especially as part of the Web 2.0 content …

more ...

Evaluation of date extraction tools for Python

Introduction

Although text is ubiquitous on the Web, extracting information from web pages can prove to be difficult, and an important problem remains as to the most efficient way to gather language data. Metadata extraction is part of data mining and knowledge extraction techniques. Dates are critical components since they are relevant both from a philological standpoint and in the context of information technology.

In most cases, immediately accessible data on retrieved webpages do not carry substantial or accurate information: neither the URL nor the server response provide a reliable way to date a web document, i.e. to find …

more ...

Evaluating scraping and text extraction tools for Python

Although text is ubiquitous on the Web, extracting information from web pages can prove to be difficult. They come in different shapes and sizes mostly because of the wide variety of platforms and content management systems, and not least because of varying reasons and diverging goals followed during web publication.

This wide variety of contexts and text genres leads to important design decisions during the collection of texts: should the tooling be adapted to particular news outlets or blogs that are targeted (which often amounts to the development of web scraping tools) or should the extraction be as generic as …

more ...

Validating TEI-XML documents with Python

This post introduces two ways to validate XML documents in Python according the guidelines of the Text Encoding Initiative, using a format commonly known as TEI-XML. The first one takes a shortcut using a library I am working on, while the second one shows an exhaustive way to perform the operation.

Both ground on LXML, an efficient library for processing XML and HTML. The following lines of code will try to parse and validate a document in the same directory as the terminal window or Python console.

Shortcut with the trafilatura library

I am currently using this web scraping …

more ...

Two studies on toponyms in literary texts

Context

Because it is impossible for individuals to “read” everything in a large corpus, advocates of distant reading employ computational techniques to “mine” the texts for significant patterns and then use statistical analysis to make statements about those patterns (Wulfman 2014).

Although the attention of linguists is commonly drawn to forms other than proper nouns, the significance of place names in particular exceeds the usual frame of deictic and indexical functions, as they encapsulate more than a mere reference in space. In a recent publication, I present two studies that center on the visualization of place names in literary texts …

more ...