Filtering links to gather texts on the web

The issue with URLs and URIs

A Uniform Resource Locator (URL), colloquially termed a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifier (URI).

Both navigation on the Web and web crawling rely on the assumption that “the Web is a space in which resources are identified by Uniform Resource Identifiers (URIs).” (Berners-Lee et al., 2006) That being said, URLs cannot be expected to be entirely reliable. Especially as part of the Web 2.0 content on the Web is changing faster than ever before, it can be tailored to a particular geographic or linguistic profile and isn’t stable in time.

Although URLs cannot be expected to be perfect predictors for the content which gets downloaded, they are often the only indication according to which crawling strategies are developed. It can be really useful to identify and discard redundant URIs, that is different URIs leading to similar text, which can also be called DUST (Schonfeld et al. 2006). Refining and filtering steps relie on URL components such as host/domain name, path and parameters/query …

more ...

Evaluation of date extraction tools for Python

Introduction

Although text is ubiquitous on the Web, extracting information from web pages can prove to be difficult, and an important problem remains as to the most efficient way to gather language data. Metadata extraction is part of data mining and knowledge extraction techniques. Dates are critical components since they are relevant both from a philological standpoint and in the context of information technology.

In most cases, immediately accessible data on retrieved webpages do not carry substantial or accurate information: neither the URL nor the server response provide a reliable way to date a web document, i.e. to find when it was written or modified. In that case it is necessary to fully parse the document or apply robust scraping patterns on it.

State of the art

Diverse extraction and scraping techniques are routinely used on web document collections by companies and research institutions alike. Content extraction mostly draws on Document Object Model (DOM) examination, that is on considering a given HTML document as a tree structure whose nodes represent parts of the document to be operated on. Less thorough and not necessarily faster alternatives use superficial search patterns such as regular expressions in order to capture desirable excerpts …

more ...

Evaluating scraping and text extraction tools for Python

Although text is ubiquitous on the Web, extracting information from web pages can prove to be difficult. They come in different shapes and sizes mostly because of the wide variety of platforms and content management systems, and not least because of varying reasons and diverging goals followed during web publication.

This wide variety of contexts and text genres leads to important design decisions during the collection of texts: should the tooling be adapted to particular news outlets or blogs that are targeted (which often amounts to the development of web scraping tools) or should the extraction be as generic as possible to provide opportunistic ways of gathering information? Due to a certain lack of time resources in academia and elsewhere, the second option is often best.

Consequently, an important problem remains as to the most efficient way to gather language data. Between CMS idiosyncrasies, bulky pages and malformed HTML, the chosen solution has to be precise, robust and fast at the same time. The purpose of this evaluation is to test currently available alternatives with respect to particular needs for coverage and speed.

The current benchmark focuses on Python, reportedly the most popular programming language in academia and one of …

more ...

Extracting the main text content from web pages using Python

Web corpus construction involves a significant number of design decisions and turning points in data processing. Depending of the purpose of data collection, it may also require a substantial filtering and quality assessment. While some large-scale algorithms can be expected to smooth out irregularities, uses requiring a low margin of error and close reading approaches (such as the search for examples in lexicographic research) imply constant refinements and improvements with respect to the building and processing of the dataset.

Interest

Because of the vastly increasing variety of corpora, text types and use cases, it becomes more and more difficult to assess the adequacy and quality of certain web data for given research objectives. A central operation in corpus construction consists in retaining the desired content while discarding the rest, a task which has many names referring to peculiar subtasks or to the whole: web scraping, boilerplate removal or boilerplate detection, web page template detection, web page cleaning, or web content extraction – for a recent overview see Lejeune & Zhu (2018).

Recently, approaches using the CommonCrawl have flourished, as they allow for faster download and processing by skipping (or more precisely outsourcing) the crawling phase. While I think that finding one’s …

more ...

A module to extract date information from web pages

Description

Metadata extraction

Diverse content extraction and scraping techniques are routinely used on web document collections by companies and research institutions alike. Being able to better qualify the contents allows for insights based on metadata (e.g. content type, authors or categories), better bandwidth control (e.g. by knowing when webpages have been updated), or optimization of indexing (e.g. language-based heuristics, LRU cache, etc.).

In short, metadata extraction is useful for different kinds of purposes ranging from knowledge extraction and business intelligence to classification and refined visualizations. It is often necessary to fully parse the document or apply robust scraping patterns, there are for example webpages for which neither the URL nor the server response provide a reliable way to date the document, that is find when it was written.

Context

I regularly work on improving the extraction methods for the web collections at my home institutions. They are unique as they combine both the quantity resulting from broad web crawling and the quality obtained by carefully extracting text and metadata as well as rejecting documents that do not match certain criteria. In that sense, I already published work on methods to derive metadata from web documents in order …

more ...

On the interest of social media corpora

Introduction

The necessity to study language use in computer-mediated communication (CMC) appears to be of common interest, as online communication is ubiquitous and raises a series of ethical, sociological, technological and technoscientific issues among the general public. The importance of linguistic studies on CMC is acknowledged beyond the researcher community, for example in forensic analysis, since evidence can be found online and traced back to its author.

In a South Park episode (“Fort Collins”, episode 6 season 20), a school girl performs “emoji analysis” to get information on the author of troll messages. Using the distribution of emojis, she concludes that this person cannot be the suspected primary school student but has to be an adult.

Workshop

I recently attended a workshop organized by the H2020-project CLARIN-PLUS on this topic. I wrote a blog post on the CLARIN blog: Reflections on the CLARIN-PLUS workshop “Creation and Use of Social Media Resources”

Ethical remark

In any case, gathering CMC data in one place and making it accessible on a massive scale to scientific apparatuses (for example indexing or user-related metadata) understandably raises concerns related to the human lives and interactions which are captured by, hidden in, or which enfold …

more ...

Collection and indexing of tweets with a geographical focus

This paper introduces a Twitter corpus focused geographically in order to (1) test selection and collection processes for a given region and (2) find a suitable database to query, filter, and visualize the tweets.

Why Twitter?

To do linguistics on texts is to do botanics on a herbarium, and zoology on remains of more or less well-preserved animals.”

Faire de la linguistique sur des textes, c’est faire de la botanique sur un herbier, de la zoologie sur des dépouilles d’animaux plus ou moins conservées.”
Inaugural speech from Charles de Tourtoulon at the Académie des sciences, agriculture, arts et belles lettres, Aix-en-Provence, 1897. (For a detailed interpretation see the introduction of my PhD thesis)

Practical reasons

  • (Lui & Baldwin 2014)

    • A frontier area due to their dissimilarity with existing corpora
  • (Krishnamurthy et al. 2008)

    • Availability and ease of use
    • Immediacy of the information presented
    • Volume and variability of the data contained
    • Presence of geolocated messages

My study of 2013 concerning other social networks (Crawling microblogging services to gather language-classified URLs …

more ...

Analysis of the German Reddit corpus

I would like to present work on the major social bookmarking and microblogging platform Reddit, which I recently introduced at the NLP4CMC workshop 2015. The article published in the proceedings is available online: Collection, Description, and Visualization of the German Reddit Corpus.

Basic idea

The work described in the article directly follows from the recent release of the “Reddit comment corpus”: Reddit user Stuck In The Matrix (Jason Baumgartner) made the dataset publicly available on the platform archive.org at the beginning of July 2015 and claimed to have any publicly available comment.

Corpus construction

In order to focus on German comments, I use a two-tiered filter in order to deliver a hopefully well-balanced performance between speed and accuracy. The first filter uses a spell-checking algorithm (delivered by the enchant library), and the second resides in my language identification tool of choice, langid.py.

The corpus is comparatively small (566,362 tokens), due to the fact that Reddit is almost exclusively an English-speaking platform. The number of tokens tagged as proper nouns (NE) is particularly high (14.4\%), which exemplifies the perplexity of the tool itself, for example because the redditors refer to trending and possibly short-lived notions and celebrities …

more ...

Finding viable seed URLs for web corpora

I recently attended the Web as Corpus Workshop in Gothenburg, where I had a talk for a paper of mine, Finding viable seed URLs for web corpora: a scouting approach and comparative study of available sources, and another with Felix Bildhauer and Roland Schäfer, Focused Web Corpus Crawling.

Summary

The comparison I did started from web crawling experiments I performed at the FU Berlin. The fact is that the conventional tools of the “Web as Corpus” framework rely heavily on URLs obtained from search engines. URLs were easily gathered that way until search engine companies restricted this allowance, meaning that one now has to pay and/or to wait longer to send queries.

I tried to evaluate the leading approach and to find decent subtitutes using social networks as well as the Open Directory Project and Wikipedia. I take four different languages (Dutch, French, Indonesian and Swedish) as examples in order to compare several web spaces with different if not opposed characteristics.

My results distinguish no clear winner, complementary approaches are called for, and it seems possible to replace or at least to complement the existing BootCaT approach. I think that crawling problems such as link/host diversity have not …

more ...

Challenges in web corpus construction for low-resource languages

I recently presented a paper at the third LRL Workshop (a joint LTC-ELRA-FLaReNet-META_NET workshop on “Less Resourced Languages, new technologies, new challenges and opportunities”).

Motivation

The state of the art tools of the “web as corpus” framework rely heavily on URLs obtained from search engines. Recently, this querying process became very slow or impossible to perform on a low budget.

Moreover, there are diverse and partly unknown search biases related to search engine optimization tricks and undocumented PageRank adjustments, so that diverse sources of URL seeds could at least ensure that there is not a single bias, but several ones. Last, the evolving web document structure and a shift from “web AS corpus” to “web FOR corpus” (increasing number of web pages and the necessity to use sampling methods) complete what I call the post-BootCaT world in web corpus construction.

Study: What are viable alternative data sources for lesser-known languages?

Trying to find reliable data sources for Indonesian, a country with a population of 237,424,363 of which 25.90 % are internet users (2011, official Indonesian statistics institute), I performed a case study of different kinds of URL sources and crawling strategies.

First, I classified URLs extracted …

more ...