Evaluation of date extraction tools for Python

Introduction

Although text is ubiquitous on the Web, extracting information from web pages can prove to be difficult, and an important problem remains as to the most efficient way to gather language data. Metadata extraction is part of data mining and knowledge extraction techniques. Dates are critical components since they are relevant both from a philological standpoint and in the context of information technology.

In most cases, immediately accessible data on retrieved webpages do not carry substantial or accurate information: neither the URL nor the server response provide a reliable way to date a web document, i.e. to find when it was written or modified. In that case it is necessary to fully parse the document or apply robust scraping patterns on it.

State of the art

Diverse extraction and scraping techniques are routinely used on web document collections by companies and research institutions alike. Content extraction mostly draws on Document Object Model (DOM) examination, that is on considering a given HTML document as a tree structure whose nodes represent parts of the document to be operated on. Less thorough and not necessarily faster alternatives use superficial search patterns such as regular expressions in order to capture desirable excerpts …

more ...

Evaluating scraping and text extraction tools for Python

Although text is ubiquitous on the Web, extracting information from web pages can prove to be difficult. They come in different shapes and sizes mostly because of the wide variety of platforms and content management systems, and not least because of varying reasons and diverging goals followed during web publication.

This wide variety of contexts and text genres leads to important design decisions during the collection of texts: should the tooling be adapted to particular news outlets or blogs that are targeted (which often amounts to the development of web scraping tools) or should the extraction be as generic as possible to provide opportunistic ways of gathering information? Due to a certain lack of time resources in academia and elsewhere, the second option is often best.

Consequently, an important problem remains as to the most efficient way to gather language data. Between CMS idiosyncrasies, bulky pages and malformed HTML, the chosen solution has to be precise, robust and fast at the same time. The purpose of this evaluation is to test currently available alternatives with respect to particular needs for coverage and speed.

The current benchmark focuses on Python, reportedly the most popular programming language in academia and one of …

more ...

Using sitemaps to crawl websites

In order to gather web documents it can be useful to download the portions of a website programmatically. This post shows ways to find URLs within a website and to work with URL lists on the command-line.

For general information on command-line operations please refer to Comment Prompt (tutorial for Windows systems), How to use the Terminal command line in macOS, or An introduction to the Linux Terminal.

Download of sitemaps and extraction of URLs

A sitemap is a file that lists the visible URLs for a given site, the main goal being to reveal where machines can look for content. The retrieval and download of documents within a website is often called crawling. The sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. Sitemaps follow the XML format, so each sitemap is or should be a valid XML file.

Download and filtering

A sitemap.xml file is usually located at the root of a website. If it is present, it is almost always to be found by appending the file name after the domain name and a slash: https://www.sitemaps.org becomes https://www.sitemaps.org/sitemap.xml …

more ...

Validating TEI-XML documents with Python

This post introduces two ways to validate XML documents in Python according the guidelines of the Text Encoding Initiative, using a format commonly known as TEI-XML. The first one takes a shortcut using a library I am working on, while the second one shows an exhaustive way to perform the operation.

Both ground on LXML, an efficient library for processing XML and HTML. The following lines of code will try to parse and validate a document in the same directory as the terminal window or Python console.

Shortcut with the trafilatura library

I am currently using this web scraping library to download web pages, find the main text and thecomments while preserving some structure, and convert the output to TXT, XML & TEI-XML. As such, I recently added a way to systematically check if the TEI-XML documents produced by the library are valid.

The library can be installed with pip or pip3 (depending on the system): pip install lxml trafilatura As this functionality is new, please update trafilatura if you have already installed it: pip install -U trafilatura.

Trafilatura will seamlessly download the schema on the first call and then return True if a document is valid or …

more ...

Two studies on toponyms in literary texts

Context

Because it is impossible for individuals to “read” everything in a large corpus, advocates of distant reading employ computational techniques to “mine” the texts for significant patterns and then use statistical analysis to make statements about those patterns (Wulfman 2014).

Although the attention of linguists is commonly drawn to forms other than proper nouns, the significance of place names in particular exceeds the usual frame of deictic and indexical functions, as they encapsulate more than a mere reference in space. In a recent publication, I present two studies that center on the visualization of place names in literary texts written in German, with particular emphasis on the concept of visualization, that is on the processes and not on the products (Crampton 2001). I discuss research on toponym extraction and linkage from an interdisciplinary perspective and address questions related to research theory and practice.

Studies

The first case consists of a preliminary study of travel literature based on Richthofen’s Travel Journals from China (Tagebücher aus China, 1907) and relies on manually annotated data. The resulting map retraces the path taken by the author in the Shandong province by combining coordinates, sequences, and a sense of time. In order to …

more ...

Extracting the main text content from web pages using Python

Web corpus construction involves a significant number of design decisions and turning points in data processing. Depending of the purpose of data collection, it may also require a substantial filtering and quality assessment. While some large-scale algorithms can be expected to smooth out irregularities, uses requiring a low margin of error and close reading approaches (such as the search for examples in lexicographic research) imply constant refinements and improvements with respect to the building and processing of the dataset.

Interest

Because of the vastly increasing variety of corpora, text types and use cases, it becomes more and more difficult to assess the adequacy and quality of certain web data for given research objectives. A central operation in corpus construction consists in retaining the desired content while discarding the rest, a task which has many names referring to peculiar subtasks or to the whole: web scraping, boilerplate removal or boilerplate detection, web page template detection, web page cleaning, or web content extraction – for a recent overview see Lejeune & Zhu (2018).

Recently, approaches using the CommonCrawl have flourished, as they allow for faster download and processing by skipping (or more precisely outsourcing) the crawling phase. While I think that finding one’s …

more ...

Franco-German workshop series on the historical illustrated press

I wrote a blog post on the Franco-German conference and workshop series I am co-organizing with Claire Aslangul (University Paris-Sorbonne) and Bérénice Zunino (University of Franche-Comté). The three events planned revolve around the same topic: the illustrated press in France and Germany from the end of the 19th to the middle of the 20th century, drawing from disciplinary fields as diverse as visual history and computational linguistics. A first workshop will take place in Besançon in April, then a larger conference will be hosted by the Maison Heinrich Heine in Paris at the end of 2018, and finally a workshop focusing on methodological issues will take place at the Berlin-Breandenburg Academy of Sciences next year in autumn.

For more information, see this description in German.

more ...

On the creation and use of social media resources

Emoji analysis”

The necessity to study language use in computer-mediated communication (CMC) appears to be of common interest, as online communication is ubiquitous and raises a series of ethical, sociological, technological and technoscientific issues among the general public. The importance of linguistic studies on CMC is acknowledged beyond the researcher community, for example in forensic science, as evidence can be found online and traced back to its author. In a South Park episode (“Fort Collins”, episode 6 season 20), a school girl performs “emoji analysis” to get information on the author of troll messages. Using the distribution of emojis, she concludes that this person cannot be the suspected primary school student but has to be an adult. Although the background story seems somehow far-fatched, as often with South Park, the logic of the analysis is sound.

General impressions on research trends

I recently went to a workshop on computer-mediated communication and social media. I am impressed by the preponderant role of Twitter data, in the focus of a significant number of researchers. This is a open field, with still much to do research on: there seems to be no clear or widely acknowledged methodology and there are diverging approaches concerning …

more ...

A module to extract date information from web pages

Description

Metadata extraction

Diverse content extraction and scraping techniques are routinely used on web document collections by companies and research institutions alike. Being able to better qualify the contents allows for insights based on metadata (e.g. content type, authors or categories), better bandwidth control (e.g. by knowing when webpages have been updated), or optimization of indexing (e.g. language-based heuristics, LRU cache, etc.).

In short, metadata extraction is useful for different kinds of purposes ranging from knowledge extraction and business intelligence to classification and refined visualizations. It is often necessary to fully parse the document or apply robust scraping patterns, there are for example webpages for which neither the URL nor the server response provide a reliable way to date the document, that is find when it was written.

Context

I regularly work on improving the extraction methods for the web collections at my home institutions. They are unique as they combine both the quantity resulting from broad web crawling and the quality obtained by carefully extracting text and metadata as well as rejecting documents that do not match certain criteria. In that sense, I already published work on methods to derive metadata from web documents in order …

more ...

On the interest of social media corpora

Introduction

The necessity to study language use in computer-mediated communication (CMC) appears to be of common interest, as online communication is ubiquitous and raises a series of ethical, sociological, technological and technoscientific issues among the general public. The importance of linguistic studies on CMC is acknowledged beyond the researcher community, for example in forensic analysis, since evidence can be found online and traced back to its author.

In a South Park episode (“Fort Collins”, episode 6 season 20), a school girl performs “emoji analysis” to get information on the author of troll messages. Using the distribution of emojis, she concludes that this person cannot be the suspected primary school student but has to be an adult.

Workshop

I recently attended a workshop organized by the H2020-project CLARIN-PLUS on this topic. I wrote a blog post on the CLARIN blog: Reflections on the CLARIN-PLUS workshop “Creation and Use of Social Media Resources”

Ethical remark

In any case, gathering CMC data in one place and making it accessible on a massive scale to scientific apparatuses (for example indexing or user-related metadata) understandably raises concerns related to the human lives and interactions which are captured by, hidden in, or which enfold …

more ...