Filtering links to gather texts on the web

The issue with URLs and URIs

A Uniform Resource Locator (URL), colloquially termed a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifier (URI).

Both navigation on the Web and web crawling rely on the assumption that “the Web is a space in which resources are identified by Uniform Resource Identifiers (URIs).” (Berners-Lee et al., 2006) That being said, URLs cannot be expected to be entirely reliable. Especially as part of the Web 2.0 content on the Web is changing faster than ever before, it can be tailored to a particular geographic or linguistic profile and isn’t stable in time.

Although URLs cannot be expected to be perfect predictors for the content which gets downloaded, they are often the only indication according to which crawling strategies are developed. It can be really useful to identify and discard redundant URIs, that is different URIs leading to similar text, which can also be called DUST (Schonfeld et al. 2006). Refining and filtering steps relie on URL components such as host/domain name, path and parameters/query strings according to the following scheme:

scheme://host:port/path?query

Being able to select links more accurately means saving bandwidth for downloads and time for further processing. That implies to know what hides behind a URL – is it rather single news story or just a category overview? – but also to try to remove duplicates beforehand as efficiently as possible, e.g. using parameters in URL.

Link manipulation with Courlan

coURLan is a command-line tool and Python library designed to clean, filter, normalize, and sample URLs. Its primary purpose is to separate the wheat from the chaff and optimize crawls by focusing on non-spam HTML pages containing primarily text, which includes:

URL validation and (basic) normalization
Filters targeting spam and unsuitable content-types
Sampling by domain name
Command-line interface (CLI) and Python tool

The tool has been field-tested on millions of URLs. The underlying software library is tested on Linux, macOS and Windows systems, it is compatible with Python 3.4 upwards. Python is reportedly the most popular programming language in academia and one of the most popular overall.

Courlan is available on the package repository PyPI. It can notably be installed with the Python package managers pip and pipenv:

$ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed

This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). A significant challenge resides in the ability to find and pre-process web sources to meet scientific expectations: Web corpus construction involves numerous design decisions, and this software package can help facilitate collection and enhance corpus quality.

URL validation

This blog post mostly focuses on the normalization and validation functions.

URL validation in courlan grounds on fixed patterns and on Python’s urllib.parse ability to split URLs into components:

>>> from courlan import validate_url
>>> validate_url('http://1234')
(False, None)
>>> validate_url('http://www.example.org/')
(True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))

Normalizing and cleaning URLs

URI normalization is the process by which URIs are modified and standardized in a consistent manner.

Basic URL scrubbing ensures URLs are comparable:

>>> from courlan import scrub_url
>>> scrub_url('https://en.wikipedia.org/')
'https://en.wikipedia.org'

To make sure that URLs are unique one has to make them comparable, notably by re-ordering the parameters. That way it is possible to determine if two syntactically different URIs may be equivalent. This technique is used by search engines to reduce indexing of duplicate pages as well as by web crawlers which perform URI normalization in order to avoid crawling the same resource more than once.

The model used for the analysis grounds on the following components describing pages within a website: the path, optional query strings or parameters, and a potential fragment. The concept of authority refers to the sometimes intricate domain name and host information.

URI = scheme:[//authority]path[?query][#fragment]

authority = [userinfo@]host[:port]

The normalization function discards parts of the parameters and re-orders them alphabetically, while optionally removing fragments:

>>> from courlan import normalize_url
>>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)
'http://test.net/foo.html?page=2&post=abc'

The helper function clean_url chains the scrubbing and normalization parts:

>>> from courlan import clean_url
>>> clean_url('HTTPS://WWW.DWDS.DE:80/')
'https://www.dwds.de'

Discriminating between external and internal URLs

Another useful feature resides in discriminating whether URLs lead to another host or authority. The following function determines if a link is internal or external:

>>> from courlan import is_external
>>> is_external('https://github.com/', 'https://www.microsoft.com/')
True
# default
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)
False
# taking suffixes into account
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)
True

Usage on the command-line

Finally, it is also possible to use courlan on the command-line:

$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
$ courlan --help  # displays all available functionality

References

Tim Berners-Lee, Wendy Hall, James A. Hendler, Kieron O’Hara, Nigel Shadbolt, and Daniel J. Weitzner (2006). A Framework for Web Science. Foundations and Trends in Web Science, 1(1):1–130.
Uri Schonfeld, Ziv Bar-Yossef, and Idit Keidar (2006). Do Not Crawl in the DUST: Different URLs with Similar Text. Proceedings of the 15th International Conference on World Wide Web, pp. 1015–1016.

See also a previous blog post: Rule-based URL cleaning for text collections