Outline

Optimizing downloads is crucial to gather data from a series of websites. However, one should respect “politeness” rules. Here is a simple way keep an eye on all these constraints as once.

  1. Problem description
  2. Simple downloads
  3. Too simple parallel downloads
  4. Much better: throttling threads
  5. Politeness rules
  6. TL;DR

Problem description

Efficient web data collection

A main objective of data collection over the Internet such as web crawling is to efficiently gather as many useful web pages as possible. One way to reach this goal is to filter the links that are to be fetched in order to maximize their adequacy to the data collection project, for example by selecting links corresponding to a series of target domains, to a target language, a topic, etc. A previous blog post addresses practical ways to perform URL selection.

Another way is to maximize throughput by working on download speed and bandwidth capacity. This part is indeed highly relevant as transmitting data over the network is very often slower than further data processing performed locally. As such, optimizing this phase is crucial for anyone wishing to gather data from a series of websites. In order to retrieve multiples web pages at once it makes sense to retrieve as many domains as possible in parallel.

Potential issues with parallelization

However, a number of issues arise when one gets to the details of the implementation. Massive downloads can be a burden for the network, the target servers or one’s own computers. On the contrary, parallel computing can lead to performance problems. for example when available cores are not used to their full capacity.

In addition, both single and concurrent downloads should respect basic “politeness” rules. Machines consume resources on the visited systems and they often visit sites unprompted. That is why issues of schedule, load, and politeness come into play. Mechanisms exist for public sites not wishing to be crawled to make this known to the crawling agent.

We will see below what these mechanisms are and how to take them into account. This additional constraint means we have to not only care for download speed but also manage a register of known websites and apply the rules so as to keep maximizing speed while not being too intrusive. To keep an eye on all these constraints as once, the best option is to use an open-source framework you can trust or take under scrutiny.

Setup

In the following, we will see how to perform downloads a fairly simple solution to the problem. see how to perform downloads sequentially and in parallel, while having essential politeness rules taken care of. The web crawling and scraping framework used below for the examples can be easily installed with pip:

$ pip install trafilatura  # pip3 where applicable

Simple downloads

Running simple downloads is straightforward. For efficiency reasons the fetch_url() fonction makes use of a connection pool where connections are kept open (unless too many websites are taken at once).

from trafilatura.downloads import fetch_url
mylist = ["https://www.example.org", "https://httpbin.org"]
for url in mylist:
    downloaded = fetch_url(url)
    # do something with it

This method is also known as single-threaded downloads.

Too simple parallel downloads

Threads are a way to run several program parts at once, see for instance An Intro to Threading in Python. Multi-threaded downloads are a good option in order to make a more efficient use of the Internet connection. The threads download pages as they go.

This only makes sense if you are fetching pages from different websites and want the downloads to run in parallel. Otherwise you could hammer a website with requests and risk getting banned.

Caution: The following code is provided for reference only:

from concurrent.futures import ThreadPoolExecutor, as_completed
from trafilatura import fetch_url

# buffer list of URLs
bufferlist = []  # [url1, url2, ...]

# download pool: 4 threads
with ThreadPoolExecutor(max_workers=4) as executor:
    future_to_url = {executor.submit(fetch_url, url): url for url in bufferlist}
    for future in as_completed(future_to_url):
        # do something here:
        url = future_to_url[future]
        print(url, future.result())

Asynchronous processing in probably even more efficient in the context of file downloads from a variety of websites. See for instance the AIOHTTP library.

Much better: throttling threads

A safe but efficient option consists in throttling requests based on domains/websites from which content is downloaded. This method is highly recommended!

The following variant of multi-threaded downloads with throttling is implemented. It also uses a compressed dictionary to store URLs and possibly save space. Both happen seamlessly, here is how to run it:

from trafilatura.downloads import add_to_compressed_dict, buffered_downloads, load_download_buffer

# list of URLs
mylist = ['https://www.example.org', 'https://www.httpbin.org/html']
# number of threads to use
threads = 2

backoff_dict = dict() # has to be defined first
# converted the input list to an internal format
dl_dict = add_to_compressed_dict(mylist)
# processing loop
while dl_dict:
    buffer, threads, dl_dict, backoff_dict = load_download_buffer(dl_dict, backoff_dict)
    for url, result in buffered_downloads(buffer, threads):
        # do something here
        print(url, result)

Politeness rules

Beware that there should be a tacit scraping etiquette and that a server may block you after the download of a certain number of pages from the same website/domain in a short period of time:

  • We want to space out requests to any given server and not request the same content multiple times in a row
  • We also should avoid parts of a server that are restricted
  • We save time for us and the others if we do not request unnecessary information (see content-aware URL selection)

Robots exclusion standard

The robots.txt file is usually available at the root of a website (e.g. www.example.com/robots.txt). It describes what a crawler should or should not crawl according to the Robots Exclusion Standard. Certain websites indeed restrict access for machines, for example by the number of web pages or site sections which are open to them.

The file lists a series of rules which define how bots can interact with the websites. It must be fetched from a website in order to test whether the URL under consideration passes the robot restrictions. These politeness policies must be respected.

Python features a module addressing the issue in its core packages, the gist of its operation is described below, for more see urllib.robotparser in the official Python documentation.

import urllib.robotparser
from trafilatura import get_crawl_delay

# define a website to look for rules
base_url = "https://www.example.org"

# load the necessary components, fetch and parse the file
rules = urllib.robotparser.RobotFileParser()
rules.set_url(base_url + "/robots.txt")
rules.read()

# determine if a page can be fetched by all crawlers
rules.can_fetch("*", "https://www.example.org/page1234.html")
# returns True or False

In addition, some websites may block certain user agents. By replacing the star with one’s user agent (e.g. bot name) we can check if we have been explicitly banned from certain sections or from all the website, which can happen when rules are ignored.

Wait between downloads

There should an interval in successive requests to avoid burdening the web servers of interest. That way, you will not slow them down and/or risk getting banned.

To prevent the execution of too many requests within too little time, the optional argument sleep_time can be passed to the load_download_buffer() function. It is the time in seconds between two requests for the same domain/website.

from trafilatura.downloads import load_download_buffer

# 30 seconds is a safe choice
mybuffer, threads, domain_dict, backoff_dict = load_download_buffer(dl_dict, backoff_dict, sleep_time=30)
# then proceed as instructed above...

One of the rules that can be defined by a robots.txt file is the crawl delay (Crawl-Delay), i.e. the time between two download requests for a given website. This delay (in seconds) can be retrieved as follows:

# get the desired information using the rules fetched above
seconds = get_crawl_delay(rules)
# provide a backup value in case no rule exists (happens quite often)
seconds = get_crawl_delay(rules, default=30)

TL;DR

Here is the simplest way to stay polite while taking all potential constraints into account:

  1. Read robots.txt files, filter your URL list accordingly and care for crawl delay
  2. Use the framework described above and set the throttling variable to a safe value (your main bottleneck is your connection speed anyway)
  3. Optional: for longer crawls, keep track of the throttling info and revisit robots.txt regularly

For further info and rules see the documentation page on downloads.