How can web sources be found?

Getting a list of web pages to start from is essential in order to build document collections. The latter are often called web corpora by linguists. The former lists of links (also known as URL lists) can be used in two different ways: first, to build a corpus straight from the pages they link to, and second, to start web crawls and hopefully discover other relevant pages. On the one hand, corpus sources are restricted to a fixed list, and on the other hand, one looks opportunistically for more content without knowing everything in advance.

The question of web corpus sources does not stop there. Indeed, one does not necessarily know where to look for interesting websites, i.e. “seeds” to start from. The answers to this issue that are frequently found in the literature are twofold: either one initiates a web crawling phase with a (small or large) list of websites, or one uses already existing link collections. Note that both strategies can also complement each other and be used alternatively during different corpus construction phases.

This post describes an easy and modern way to gather web sources using search engines by adapting the BootCat method, whose positive and negative aspects are discussed. The tutorial below shows how to find web pages by using search engines as a URL directory and taking results that come up first. This approach indirectly benefits from the web page classification performed by search engines, e.g. language-based targeting, but also includes its (possibly unknown) biases.

The BootCaT method

Description

The BootCat approach (Baroni & Bernardini 2004) grounds on the assumption that randomly generated search engines queries will lead to combined cross-domain text collections, and that domain-specific keywords will lead to focused collections. The queries consist of several randomly combined words called word seeds. As a result, seed URLs (links) are gathered which are used as a starting point.

The words are first coming from an initial list. Later in the collection process they can be extracted from the gathered corpus itself if necessary. This phase is also called unigram extraction, as words are directly sampled from the collected texts and (possibly randomly) combined.

The relevant results are usually the first links returned by a search engine, come what may, and may be as numerous as allowed by the access point or API. These restrictions can also evolve over time or be adapted to the construction strategy.

General issues

Two potential issues are common to most web corpus building projects.

The validity and relevance of the collected links rely on the assumption that “the Web is a space in which resources are identified by Uniform Resource Identifiers (URIs)” (Berners-Lee et al., 2006). Nothing much can be done in that regard though as we need URLs to start from.

In addition, the actual contents of a web corpus can only be listed with certainty once the corpus is complete. Its adequacy, focus and quality has to be assessed in a post hoc evaluation (Baroni et al., 2009). This is not only the case with the BootCaT corpora but more a general issue.

Specific issues

When it works, BootCat can be a convenient way to rapidly get to text resources. However, a few issues specific to this method are worth mentioning.

Most notably, the querying process can be cumbersome due to the increasing limitation and commodization of mechanical search engine access. In practical terms, the method may be too expensive and/or too unstable in time to support corpus building projects.

Other technical difficulties include diverse and partly unknown search biases related to search engine optimization tricks (on the side of the indexed websites) as well as undocumented adjustments of indexing algorithms (on the search engines’ side). Research reproducibility also cannot be guaranteed.

I elaborate on these aspects in other blog posts, Challenges in web corpus construction for low-resource languages & Finding viable seed URLs for web corpora.

Corpus linguistics

Speaking from a corpus linguist’s perspective, the question whether the BootCaT method provides a good overview of a language remains open.

Poorly performing random word seeds cannot be clearly predicted or assessed. There are also a number of potential caveats regarding corpus quality which are difficult to assess (e.g. text types and quality). Nonetheless it is conceivable that companies running search engines have different priorities than humanities researchers.

Experiments tend to show that carefully selected sources are more efficient than URLs taken from search engines results, both initially and in the long run, that is in the course of web crawls (Schäfer et al. 2014).

Practical how-to

Despite the shortcomings described above there are valid reasons to gather texts this way. So here is how to make this method work in a straightforward and modular way. The following steps should indeed be easy to reproduce and to modify if necessary. Some of them are run using the Python programming language but it should be easy to find equivalents in other major languages.

1. Word list

First, you need a list of words in the target language(s). For general-purpose corpora any list could do, optionally using certain word categories like nouns. For German see for instance the DWDS list. Using words revolving around certain topics should hopefully lead to webpages addressing the given topics.

To sum up, the word lists can be filtered according to grammatical or topical criteria. You can also write a list by hand, bear in mind that it should be long enough to allow for diverse word combinations in the next step.

2. Random word combinations

The next step consists in preparing queries that will be made by drawing random word combinations from the list gathered in the previous step. A series of several randomly combined words is called word seeds in the literature.

Here is how to draw random word series with Python:

>>> import random
# start from the custom word list obtained in (1)
>>> wordlist = ['word1', 'word2', 'word3', 'word4']  # and so on
# draw 3 random words from the list
>>> selection = random.choices(wordlist, k=3)

3. URLs from search engines

Then you have to extract URLs from results obtained by search engines queries using the random word tuples defined in (2).

Here are examples of Python modules to query search engines: search-engine-parser and GoogleScraper. They seem fairly popular at the moment.

One of the main drawbacks of the BootCaT method is that it is not stable in time, both search engines and scraper modules may not work as intended anymore. In that case it would be necessary to look for alternatives, look for concepts like “SERP” and “search engines scraping”.

4. Download and processing

Finally, my toolkit for web corpus construction can shine: all you need is a list of URLs, Trafilatura will do the rest. This light-weight software package works with Python or on the command-line. It will retrieve the pages and extract texts and metadata, which can be stored in various common formats (CSV, JSON, TXT, XML). You can then work with the documents using the tools of your choice.

To download and process the link list, see the usage documentation.

Mitigating potential issues

For relatively small and focused corpora, human supervision is key. It is advisable to keep an eye on all steps of corpus construction. This does not only apply to this method, here it would notably include performing random queries oneself.

Screening and refining the lists of URLs you use for your projects can also enhance corpus quality, see for example the concept of URL sampling (Henzinger et al. 2000), the implementation details in the papers mentioned below, and the filtering tool courlan.

Further reading

As with “traditional” corpora, web corpora can either focus on a given range of websites and topics, or be merely language-minded and opportunistically take all kinds of possible texts into account. In the latter case, using diverse sources for URL seeds could ensure there is no potentially unknown bias.

The documentation page on finding URLs for web corpora from the corpus construction tool has more information.

References

  • Barbaresi, A. (2015). Ad hoc and general-purpose corpus construction from web sources (Doctoral dissertation, ENS Lyon).
  • Barbaresi, A. (2021). Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction, Proceedings of ACL/IJCNLP 2021: System Demonstrations, p. 122-131.
  • Baroni, M., & Bernardini, S. (2004). BootCaT: Bootstrapping Corpora and Terms from the Web. In Proceedings of LREC 2004 (pp. 1313-1316).
  • Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation, 43(3), 209-226.
  • Berners-Lee, T., Hall, W., & Hendler, J. A.(2006). A framework for web science. Found. Trends Web Sci. 1, 1, 1–130.
  • Henzinger, M. R., Heydon, A., Mitzenmacher, M., & Najork, M. (2000). On near-uniform URL sampling. Computer Networks, 33(1-6), 295-308.
  • Schäfer, R., Barbaresi, A., & Bildhauer, F. (2014). Focused web corpus crawling. In Proceedings of the 9th Web as Corpus workshop (WAC-9), pp. 9-15.