In order to gather web documents it can be useful to download the portions of a website programmatically. This post shows ways to find URLs within a website and to work with URL lists on the command-line.

For general information on command-line operations please refer to Comment Prompt (tutorial for Windows systems), How to use the Terminal command line in macOS, or An introduction to the Linux Terminal.

Download of sitemaps and extraction of URLs

A sitemap is a file that lists the visible URLs for a given site, the main goal being to reveal where machines can look for content. The retrieval and download of documents within a website is often called crawling. The sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. Sitemaps follow the XML format, so each sitemap is or should be a valid XML file.

Download and filtering

A sitemap.xml file is usually located at the root of a website. If it is present, it is almost always to be found by appending the file name after the domain name and a slash: https://www.sitemaps.org becomes https://www.sitemaps.org/sitemap.xml.

In order to retrieve and process a sitemap, the following tools can be used: wget to download files from the Internet; cat to open files; and grep to search for information within the sitemaps.

# download a webpage and output the result in the terminal
wget --quiet -O- "https://www.iana.org/"
# a more elaborate way to use wget (recommended)
wget "https://www.iana.org/" -O- --append-output=wget.log --user-agent "$useragent" --waitretry 60 --tries=2
# --no-check-certificate is also an interesting option

Assuming that the XML files are regular/conform and can be searched in a “quick and dirty” way without parsing (which seems reasonable here):

# Search for all patterns that start like a URL:
# -P extends the range of usable regular expressions to Perl-style ones
# -o only print the matching expressions and not the rest of the line
# \K in the expression marks the start of content to output
cat filename | grep -Po "https?://\K.+?$"

In sum and in order to better capture the URLs and clean the result:

cat sitemap.xml | grep -Po "<loc>\K.+?(?=</loc>)"
# works just the same:
< filename grep -Po "https?://\K.+?$"

Nested sitemaps

It can happen that the first sitemap to be seen acts as a first-level sitemap, listing a series of other sitemaps which then lead to HTML pages: https://fussballlinguistik.de/sitemap.xml lists two differents XML files: the first contains the text content (https://fussballlinguistik.de/sitemap-1.xml) while the other element deals with images.

In this case, we can get to the content be using the second-level file which lists the urls that way: <url><loc>https://fussballlinguistik.de/2016/10/hello-world/</loc>...

    wget -qO- "https://fussballlinguistik.de/sitemap-1.xml" | grep -Po "<loc>\K.+?(?=</loc>)" > fussballlinguistik-urls.txt

It is easy to write code dealing with this situation: if you have found a valid sitemap but all the links end in .xml it is most probably a first-level sitemap. It can be expected that sitemap-URLs are wrapped within a <sitemap> element while web pages are listed as <url>.

Further navigation

Another method for the extraction of URLs is described by Noah Bubenhofer in his tutorial on Corpus Linguistics, Download von Web-Daten I. This gist of it is to use another command-line tool (cURL) to download series of pages and then to look for links in the result if necessary:

# The selector [1-11] programmatically crawls the summary pages
# while #1 refers to the page number to store the content
curl -L https://www.bubenhofer.com/sprechtakel/page/[1-11]/ -o "sprechtakel_page_#1.html"

Looking at the HTML code of the pages, we see that (as of now) each blog entry is found that way: <h2 class="entry-title"><a href="https://www.bubenhofer.com/sprechtakel/2015/08/...</pre>. It is then possible to use grep to extract the list of URLs.

In the following combination the -h option disables the output of file names, a non-greedy matching (1st question mark) is used as well as a lookahead (2nd one) in the regular expression:

grep -hPo '<h2 class="entry-title"><a href="\K.+?(?=")' *.html > urls.txt

A few useful tools

# sort the links and make sure they are unique
sort -u myfile.txt
# shuffle the URLs
shuf myfile.txt

The Python tool for web text extraction I am working on (trafilatura) can process a list of URLs and find the main text along with useful metadata. It can be used from Python or on the command-line:

trafilatura -i list_of_urls.txt

Finally, please note that it is not considered to be fair to retrieve a large number of documents from a website in a short period of time. For more information, please check the robots exclusion standard and the robots.txt file for websites who enforce it.