In order to gather web documents it can be useful to download the portions of a website programmatically, mostly to save time and resources. The retrieval and download of documents within a website is often called web crawling or web spidering. This post describes practical ways to find URLs within a website and to work with URL lists on the command-line. It contains all necessary code snippets to optimize link discovery and filtering.

Getting started

Interest of sitemaps

A sitemap is a file that lists the visible or whitelisted URLs for a given site, the main goal being to reveal where machines can look for content. Web crawlers usually discover pages from links within the site and from other sites, following a series of rules and protocols. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata.

The sitemaps protocol primarily allows webmasters to inform search engines about pages on their sites that are available for crawling. Crawlers can use it to pick up all URLs in the sitemap and learn about those URLs using the associated metadata. Sitemaps follow the XML format, so each sitemap is or should be a valid XML file.

Sitemaps are particularly useful by large or complex websites since they are made so that machines can more intelligently crawl the site. This particularly true if there is a chance to overlook some of the new or recently updated content, for example because some areas of the website are not available through the browsable interface, or when websites have a huge number of pages that are isolated or not well linked together.

About this tutorial

The following instructions use the command-line interface (CLI):

Straightforward download, listing and processing

Tooling

The tool for web text extraction I am working on can process a list of URLs and find the main text along with useful metadata. Trafilatura is a Python package and command-line tool which seamlessly downloads, parses, and scrapes web page data: it can extract metadata, main body text and comments while preserving parts of the text formatting and page structure.

In addition, trafilatura includes support for multilingual and multinational sitemaps. For example, a site can target English language users through links likehttp://www.example.com/en/… and German language users through http://www.example.com/de/….

With its focus on straightforward, easy extraction, this tool can be used straight from the command-line. In all, trafilatura should make it much easier to list links from sitemaps and also download and process them if required. It is the recommended method to perform the task described here.

If this command sounds familiar you can install it directly with it: pip3 install trafilatura. Otherwise please refer to the tutorial on installing the trafilatura tool.

Gathering links

Main features

  • Links can be gathered straight from the homepage (using heuristics) or using a particular URL if it is already known
  • The --list option is useful to list URLs prior to processing
  • Links discovery can start from an input file (-i) containing a list of sources which will then be processed in parallel

Examples

The following examples return lists of links. If --list is absent the pages that have been found are directly retrieved, processed, and returned in the chosen output format (per default: TXT and standard output).

# run link discovery through a sitemap for sitemaps.org and store the resulting links in a file
$ trafilatura --sitemap "https://www.sitemaps.org/" --list > mylinks.txt
# using an already known sitemap URL
$ trafilatura --sitemap "https://www.sitemaps.org/sitemap.xml" --list
# targeting webpages in German
$ trafilatura --sitemap "https://www.sitemaps.org/" --list --target-language "de"

Websites can also be examined in parallel by providing a list as input, named here mylist.txt. The list has to entail a series of URLs, one per line.

# parallel retrieval of sitemap links from list of websites
$ trafilatura -i mylist.txt --sitemap --list
# same operation, targeting webpages in German
$ trafilatura -i mylist.txt --sitemap --list --target-language "de"

Step-by-step alternative

An alternative step-by-step method is described below, for reference and in case trafilatura cannot be used.

Download and filtering

A sitemap.xml file is usually located at the root of a website. If it is present, it is almost always to be found by appending the file name after the domain name and a slash: https://www.sitemaps.org becomes https://www.sitemaps.org/sitemap.xml.

In order to retrieve and process a sitemap, the following tools can be used: wget to download files from the Internet; cat to open files; and grep to search for information within the sitemaps.

# download a webpage and output the result in the terminal
wget --quiet -O- "https://www.iana.org/"
# a more elaborate way to use wget (recommended)
wget "https://www.iana.org/" -O- --append-output=wget.log --user-agent "$useragent" --waitretry 60 --tries=2
# --no-check-certificate is also an interesting option

Assuming that the XML files are regular/conform and can be searched in a “quick and dirty” way without parsing (which seems reasonable here):

# Search for all patterns that start like a URL:
# -P extends the range of usable regular expressions to Perl-style ones
# -o only print the matching expressions and not the rest of the line
# \K in the expression marks the start of content to output
cat filename | grep -Po "https?://\K.+?$"

In sum and in order to better capture the URLs and clean the result:

cat sitemap.xml | grep -Po "<loc>\K.+?(?=</loc>)"
# works just the same:
< filename grep -Po "https?://\K.+?$"

Nested sitemaps

It can happen that the first sitemap to be seen acts as a first-level sitemap, listing a series of other sitemaps which then lead to HTML pages: https://fussballlinguistik.de/sitemap.xml lists two differents XML files: the first contains the text content (https://fussballlinguistik.de/sitemap-1.xml) while the other element deals with images.

In this case, we can get to the content be using the second-level file which lists the urls that way:

<url><loc>https://fussballlinguistik.de/2016/10/hello-world/</loc>...

wget -qO- "https://fussballlinguistik.de/sitemap-1.xml" | grep -Po "<loc>\K.+?(?=</loc>)" > urls.txt

It is easy to write code dealing with this situation: if you have found a valid sitemap but all the links end in .xml it is most probably a first-level sitemap. It can be expected that sitemap-URLs are wrapped within a <sitemap> element while web pages are listed as <url>.

Automated navigation

Another method for the extraction of URLs is described by Noah Bubenhofer in his tutorial on Corpus Linguistics, Download von Web-Daten I. This gist of it is to use another command-line tool (cURL) to download series of pages and then to look for links in the result if necessary:

# The selector [1-11] programmatically crawls the summary pages
# while #1 refers to the page number to store the content
curl -L https://www.bubenhofer.com/sprechtakel/page/[1-11]/ -o "sprechtakel_page_#1.html"

Looking at the HTML code of the pages, we see that (as of now) each blog entry is found that way:

<h2 class="entry-title"><a href="https://www.bubenhofer.com/sprechtakel/2015/08/...</h2>

It is then possible to use grep to extract the list of URLs.

In the following combination the -h option disables the output of file names, a non-greedy matching (1st question mark) is used as well as a lookahead (2nd one) in the regular expression:

grep -hPo '<h2 class="entry-title"><a href="\K.+?(?=")' *.html > urls.txt

Going further

A few useful tools

# sort the links and make sure they are unique
sort -u myfile.txt
# shuffle the URLs
shuf myfile.txt

Content discovery

See more about content discovery in this tutorial: Gathering a custom web corpus.

Finally, please note that it is not considered to be fair to retrieve a large number of documents from a website in a short period of time. For more information, please check the robots exclusion standard and the robots.txt file for websites who enforce it.