In this post, I would like to come back to a seminal article on corpus linguistics and lexicography. I will summarize and discuss points raised by Adam Kilgarriff, Googleology is bad science. In the following, I also show which challenges arise and how alternative solutions can be developed.

The article dates back to 2007, however it has not lost its relevance and the questions raised about research in linguistics and lexicography also concern various other fields across social sciences and humanities. Kilgarriff sums up issues related to the “low-entry-cost way to use the Web” which commercial search engines represent. To answer these concerns, he shows how an alternative can be developed within the academic community.

Issues

This section summarizes points raised in the article (Kilgarriff 2007).

Kilgarriff notes that “if the work is to proceed beyond the anecdotal, a range of issues must be addressed.” Three different kind of issues can be distinguished:

  1. Queries:
    • There are constraints on numbers of queries and numbers of hits per query
    • The search syntax is limited
    • Lack of linguistic annotation: the commercial search engines do not lemmatise or part-of-speech tag
  2. Results:
    • Search hits are for pages, not for instances
    • Even if it is just to use the URLs, the hits are sorted according to a complex and unknown algorithm
  3. Counts:
    • Search engine counts are arbitrary

To conclude, Kilgarriff notes that investigating the biases to try to find a usable way belongs to the area of “googleology” and not to linguistic research.

The alternative: Work like search engines

The numerous biases and blind spots were (and sometimes still are) unaccounted for in scientific publications. The solution outlined in the article resides in replicating the background work performed by search engines, but in the controlled environment of a research lab.

“An alternative is to work like the search engines, downloading and indexing substantial proportions of the World Wide Web, but to do so transparently, giving reliable figures, and supporting language researchers’ queries.”

General view of the process

The whole process can be named “web corpus construction”. It goes all the way from gathering linguistic data on the Web to making them available through a query interface. To do so, a series of steps is necessary, the basis and skeleton of a web corpus infrastructure.

“The process involves crawling, downloading, ’cleaning’ and de-duplicating the data, then linguistically annotating it and loading it into a corpus query tool.”

Some of the steps are more important than others:

  • Crawling and downloading data could be merged as one big step.
  • De-duplication can also be considered to be a part of text cleaning, both being sometimes regrouped as a into a general corpus pre-processing phase.
  • Certain corpus linguistics and NLP tools can load the data and annotate it for you so that the last two steps do not appear to be separated from each other.

Overall, three distinct phases can be distinguished, all quite important in their own right:

  1. Web crawling determines the range and the general contents of a web corpus.
  2. Data pre-processing impacts all the other steps downstream.
  3. Linguistic annotation and query tools give profile to the data, they can make certain features noticeable while blurring others.

Comments worth noting

Web crawling and downloads

A crawler is a computer program that automatically and systematically visits web pages. Crawling implies to send robots across the Web in order to “read” web pages and collect information about them. It is a research field in itself and also the groundwork on which search engines are built.

In this regard this steps are anything but trivial, they determine how many documents are taken into consideration and how extensive the download phase will be. For example, one can be more opportunistic and gather more data by making a compromise on text quality, or restrict data collection to previously examined web pages. Big data approaches tend to use the former way, traditional corpus linguistics the latter.

Massively downloading web pages can be a challenge, for a general overview and practical hints see How to download web pages in parallel and follow politeness rules in Python.

To better control corpus contents it is useful to take lists of links (to be downloaded or already in the corpus) into account and operate on them, e.g. by filtering or sampling them. See the posts on filtering links content-aware URL filtering.

Text cleaning and pre-processing

Text cleaning in Web collections is too often overlooked. As Kilgarriff indicates:

“Cleaning is a low-level, unglamorous task, yet crucial: The better it is done, the better the outcomes. All further layers of linguistic processing depend on the cleanliness of the data.”

When confronted with web pages, the main issues affecting the content can be summarized as follows:

  • How do we detect and get rid of navigation bars, headers, footers, etc.?
  • How do we identify metadata, paragraphs and other structural information?
  • How do we produce output in a standard form suitable for further processing?

On site level, recurring elements are called boilerplate. Removing them allow for avoiding hundreds of occurrences of phrases like “back to the main page” or “Copyright 2022 (site name)”.

Preserving some elements of the page structure can be useful to distinguish main text, quotes and comments. Authorship definitely is meaningful in a humanities context. Metadata such as the page title or the publication date are also quite relevant.

Concrete solutions

Data collection and pre-processing

Trafilatura is the Python software package and command-line tool which I designed as a way to address the data collection issues described above.

It seamlessly downloads, parses, and scrapes web page data: it can extract text & metadata while preserving parts of text formatting and page structure. This light-weight package does not get in your way but acts as a modular toolkit: no database is required, the output can be converted to different commonly used formats.

Its intended audience encompasses disciplines where collecting web pages represents an important step for data collection, notably linguistics, natural language processing and social sciences. In general, it is relevant for anyone interested in gathering texts from the Web.

For more information, please refer to the documentation and notably the part on automated web crawling.

Work with the gathered data

After gathering texts from the Web, what to do next? The programming languages Python and R are well-suited for data analysis, they come with a whole series of software packages dedicated to linguistic processing and statistics. They allow for the development of modular solutions which can better fit one’s needs. On the other hand, a series of corpus analysis tools is also available, they mostly work off-the-shelf and provide a range of commonly used functions.

This documentation page lists options to work with output generated by the software mentioned above: Working with corpus data, from common formats to software tools used in corpus linguistics, natural language processing and data science.