In order to gather web documents it can be useful to download the portions of a website programmatically, mostly to save time and resources. The retrieval and download of documents within a website is often called web crawling or web spidering. This post describes practical ways to crawl websites and by working with sitemaps on the command-line. It contains all necessary code snippets to optimize link discovery and filtering.
Interest of sitemaps
A sitemap is a file that lists the visible or whitelisted URLs for a given site, the main goal being to reveal where machines can look for content. Web crawlers usually discover pages from links within the site and from other sites, following a series of rules and protocols. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata.
The sitemaps protocol primarily allows webmasters to inform search engines about pages on their sites that are available for crawling. Crawlers can use it to pick up all URLs in the sitemap and learn about those URLs using the associated metadata. Sitemaps follow the XML format, so each sitemap is or should be a valid XML file.
Sitemaps are particularly useful by large or complex websites since they are made so that machines can more intelligently crawl the site. This is particularly true if there is a chance to overlook some of the new or recently updated content, for example because some areas of the website are not available through the browsable interface, or when websites have a huge number of pages that are isolated or not well linked together.
About this tutorial
The following instructions use the command-line interface (CLI):
- For a primer please refer to this excellent step-by-step introduction to the CLI.
- For general information please refer to Comment Prompt (tutorial for Windows systems), How to use the Terminal command line in macOS, or An introduction to the Linux Terminal.
Straightforward download, listing and processing
The tool for web text extraction I am working on can process a list of URLs and find the main text along with useful metadata. Trafilatura is a Python package and command-line tool which seamlessly downloads, parses, and scrapes web page data: it can extract metadata, main body text and comments while preserving parts of the text formatting and page structure.
In addition, trafilatura includes support for multilingual and multinational sitemaps. For example, a site can target English language users through links like
http://www.example.com/en/… and German language users through
With its focus on straightforward, easy extraction, this tool can be used straight from the command-line. In all, trafilatura should make it much easier to list links from sitemaps and also download and process them if required. It is the recommended method to perform the task described here.
If this command sounds familiar you can install it directly with it:
pip3 install trafilatura. Otherwise please refer to the tutorial on installing the trafilatura tool.
- Links can be gathered straight from the homepage (using heuristics) or using a particular URL if it is already known
--listoption is useful to list URLs prior to processing
- Links discovery can start from an input file (-i) containing a list of sources which will then be processed in parallel
The following examples return lists of links. If
--list is absent the pages that have been found are directly retrieved, processed, and returned in the chosen output format (per default: TXT and standard output).
# run link discovery through a sitemap for sitemaps.org and store the resulting links in a file $ trafilatura --sitemap "https://www.sitemaps.org/" --list > mylinks.txt # using an already known sitemap URL $ trafilatura --sitemap "https://www.sitemaps.org/sitemap.xml" --list # targeting webpages in German $ trafilatura --sitemap "https://www.sitemaps.org/" --list --target-language "de"
Websites can also be examined in parallel by providing a list as input, named here
mylist.txt. The list has to entail a series of URLs, one per line.
# parallel retrieval of sitemap links from list of websites $ trafilatura -i mylist.txt --sitemap --list # same operation, targeting webpages in German $ trafilatura -i mylist.txt --sitemap --list --target-language "de"
An alternative step-by-step method is described below, for reference and in case trafilatura cannot be used.
Download and filtering
sitemap.xml file is usually located at the root of a website. If it is present, it is almost always to be found by appending the file name after the domain name and a slash:
# download a webpage and output the result in the terminal wget --quiet -O- "https://www.iana.org/" # a more elaborate way to use wget (recommended) wget "https://www.iana.org/" -O- --append-output=wget.log --user-agent "$useragent" --waitretry 60 --tries=2 # --no-check-certificate is also an interesting option
Assuming that the XML files are regular/conform and can be searched in a “quick and dirty” way without parsing (which seems reasonable here):
# Search for all patterns that start like a URL: # -P extends the range of usable regular expressions to Perl-style ones # -o only print the matching expressions and not the rest of the line # \K in the expression marks the start of content to output cat filename | grep -Po "https?://\K.+?$"
In sum and in order to better capture the URLs and clean the result:
cat sitemap.xml | grep -Po "<loc>\K.+?(?=</loc>)" # works just the same: < filename grep -Po "https?://\K.+?$"
It can happen that the first sitemap to be seen acts as a first-level sitemap, listing a series of other sitemaps which then lead to HTML pages:
https://fussballlinguistik.de/sitemap.xml lists two differents XML files: the first contains the text content (
https://fussballlinguistik.de/sitemap-1.xml) while the other element deals with images.
In this case, we can get to the content be using the second-level file which lists the urls that way:
wget -qO- "https://fussballlinguistik.de/sitemap-1.xml" | grep -Po "<loc>\K.+?(?=</loc>)" > urls.txt
It is easy to write code dealing with this situation: if you have found a valid sitemap but all the links end in
.xml it is most probably a first-level sitemap. It can be expected that sitemap-URLs are wrapped within a
<sitemap> element while web pages are listed as
Another method for the extraction of URLs is described by Noah Bubenhofer in his tutorial on Corpus Linguistics, Download von Web-Daten I. This gist of it is to use another command-line tool (cURL) to download series of pages and then to look for links in the result if necessary:
# The selector [1-11] programmatically crawls the summary pages # while #1 refers to the page number to store the content curl -L https://www.bubenhofer.com/sprechtakel/page/[1-11]/ -o "sprechtakel_page_#1.html"
Looking at the HTML code of the pages, we see that (as of now) each blog entry is found that way:
<h2 class="entry-title"><a href="https://www.bubenhofer.com/sprechtakel/2015/08/...</h2>
It is then possible to use
grep to extract the list of URLs.
grep -hPo '<h2 class="entry-title"><a href="\K.+?(?=")' *.html > urls.txt
A few useful tools
# sort the links and make sure they are unique sort -u myfile.txt # shuffle the URLs shuf myfile.txt
See more about content discovery in this tutorial: Gathering a custom web corpus.
Finally, please note that it is not considered to be fair to retrieve a large number of documents from a website in a short period of time. For more information, please check the robots exclusion standard and the
robots.txt file for websites who enforce it.