Why filter URLs?

Following from a previous post (Filtering links to gather texts on the web), I’d like to say a bit more about the URL filtering utility working on. I came to realize that although there are existing libraries performing normalizations operation on the URLs, there is no such thing as a tool driven by text research, in particular concerning internationalization and language-based filtering.

My impression is that one could use some kind of an additional brain during crawling in order to better refine the crawl frontier, that is the (priority) queue storing links selected for further page visits. Ideally, the URLs in the queue need to be constantly prioritized and filtered so as to maximize the throughput.

Enter Courlan

The idea behind the courlan library is to help web crawlers and web archives alike to better manage the resources by targeting particular web pages, that is text-based HTML documents, optionally in a target language, or even by strictly excluding certain domains or spam patterns.

Whether you have an existing link collection or actively look for new links, this navigational help targets text-based documents (i.e. currently web pages expected to be in HTML format) and tries to guess the language of pages to allow for language-focused collection. Additional functions include straightforward domain name extraction and URL sampling. In addition, it entails specific fonctionality for crawlers: stay away from pages with little text content or target synoptic pages explicitly to gather links.

The software allows for focusing on promising URLs, which can be pages with text or navigation pages, as a crawling strategy can be to gather links first, and then the pages of interest. With that in mind, the library revolves around two different operations:

  1. The triage of links
    • Targeting spam and unsuitable content-types
    • Language-aware filtering
    • Crawl management
  2. URL handling and normalization
    • Validation
    • Canonicalization/Normalization
    • Sampling

Software ecosystem

courlan works both on the command line and from Python. It is part of software ecosystem designed for web scraping and web crawling. The Python web scraper trafilatura builds upon it in order to better retrieve links from web pages, for instance when starting an automated crawl from a homepage. The date extraction utility htmldate is also part of the bundle.

Tutorial and code examples

Avoid wasting bandwidth capacity and processing time for webpages which are probably not worth the effort. The following provides a tutorial with code snippets for crawling, scraping, but also management of Internet archives.

The software is readily available from the Python package index Pypi and ongoing work is happening on the Courlan GitHub repository, please refer to those for more information.

The following examples demonstrate the functions that have recently been added to the software, and focus on web crawling and internationalisation. They can be used quite easily, you just need to install the package first: pip install courlan (pip3 where applicable).

Language-aware heuristics

Language-aware heuristics, notably internationalization in URLs, are available in lang_filter(url, language):

from courlan import check_url
# optional argument targeting webpages in English or German
>>> url = 'https://www.un.org/en/about-us'
# success: returns clean URL and domain name
>>> check_url(url, language='en')
('https://www.un.org/en/about-us', 'un.org')
# failure: doesn't return anything
>>> check_url(url, language='de')
>>>
# optional argument: strict
>>> url = 'https://en.wikipedia.org/'
>>> check_url(url, language='de', strict=False)
('https://en.wikipedia.org', 'wikipedia.org')
>>> check_url(url, language='de', strict=True)
>>>

Strict filtering

Define stricter restrictions on the expected content type with strict=True. Also blocks certain platforms and pages types crawlers should stay away from if they don’t target them explicitly and other black holes where machines get lost.

# strict filtering
>>> check_url('https://www.twitch.com/', strict=True)
# blocked as it is a major platform

Web crawling and URL handling

Determine if a link leads to another host:

>>> from courlan import is_external
>>> is_external('https://github.com/', 'https://www.microsoft.com/')
True
# default
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)
False
# taking suffixes into account
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)
True

Other useful functions dedicated to URL handling:

  • get_base_url(url): strip the URL of some of its parts
  • get_host_and_path(url): decompose URLs in two parts: protocol + host/domain and path
  • get_hostinfo(url): extract domain and host info (protocol + host/domain)
  • fix_relative_urls(baseurl, url): prepend necessary information to relative links

Here are examples:

>>> from courlan import *
>>> url = 'https://www.un.org/en/about-us'
>>> get_base_url(url)
'https://www.un.org'
>>> get_host_and_path(url)
('https://www.un.org', '/en/about-us')
>>> get_hostinfo(url)
('un.org', 'https://www.un.org')
>>> fix_relative_urls('https://www.un.org', 'en/about-us')
'https://www.un.org/en/about-us'

Other filters dedicated to crawl frontier management:

  • is_not_crawlable(url): check for deep web or pages generally not usable in a crawling context
  • is_navigation_page(url): check for navigation and overview pages

Here is how they work:

>>> from courlan import is_navigation_page, is_not_crawlable
>>> is_navigation_page('https://www.randomblog.net/category/myposts')
True
>>> is_not_crawlable('https://www.randomblog.net/login')
True

References

URL-based heuristics and URL list processing prior to web crawling have also been discussed in scientific work. Here are references to articles on related questions:

Some of my work on the topic: