In order to gather web documents it can be useful to download the portions of a website programmatically. This post shows ways to find URLs within a website and to work with URL lists on the command-line.
For general information on command-line operations please refer to Comment Prompt (tutorial for Windows systems), How to use the Terminal command line in macOS, or An introduction to the Linux Terminal.
Download of sitemaps and extraction of URLs
A sitemap is a file that lists the visible URLs for a given site, the main goal being to reveal where machines can look for content. The retrieval and download of documents within a website is often called crawling. The sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. Sitemaps follow the XML format, so each sitemap is or should be a valid XML file.
Download and filtering
sitemap.xml file is usually located at the root of a website. If it is present, it is almost always to be found by appending the file name after the domain name and a slash:
# download a webpage and output the result in the terminal wget --quiet -O- "https://www.iana.org/" # a more elaborate way to use wget (recommended) wget "https://www.iana.org/" -O- --append-output=wget.log --user-agent "$useragent" --waitretry 60 --tries=2 # --no-check-certificate is also an interesting option
Assuming that the XML files are regular/conform and can be searched in a “quick and dirty” way without parsing (which seems reasonable here):
# Search for all patterns that start like a URL: # -P extends the range of usable regular expressions to Perl-style ones # -o only print the matching expressions and not the rest of the line # \K in the expression marks the start of content to output cat filename | grep -Po "https?://\K.+?$"
In sum and in order to better capture the URLs and clean the result:
cat sitemap.xml | grep -Po "<loc>\K.+?(?=</loc>)" # works just the same: < filename grep -Po "https?://\K.+?$"
It can happen that the first sitemap to be seen acts as a first-level sitemap, listing a series of other sitemaps which then lead to HTML pages:
https://fussballlinguistik.de/sitemap.xml lists two differents XML files: the first contains the text content (
https://fussballlinguistik.de/sitemap-1.xml) while the other element deals with images.
In this case, we can get to the content be using the second-level file which lists the urls that way:
wget -qO- "https://fussballlinguistik.de/sitemap-1.xml" | grep -Po "<loc>\K.+?(?=</loc>)" > fussballlinguistik-urls.txt
It is easy to write code dealing with this situation: if you have found a valid sitemap but all the links end in
.xml it is most probably a first-level sitemap. It can be expected that sitemap-URLs are wrapped within a
<sitemap> element while web pages are listed as
Another method for the extraction of URLs is described by Noah Bubenhofer in his tutorial on Corpus Linguistics, Download von Web-Daten I. This gist of it is to use another command-line tool (cURL) to download series of pages and then to look for links in the result if necessary:
# The selector [1-11] programmatically crawls the summary pages # while #1 refers to the page number to store the content curl -L https://www.bubenhofer.com/sprechtakel/page/[1-11]/ -o "sprechtakel_page_#1.html"
Looking at the HTML code of the pages, we see that (as of now) each blog entry is found that way:
<h2 class="entry-title"><a href="https://www.bubenhofer.com/sprechtakel/2015/08/...</pre>. It is then possible to use grep to extract the list of URLs.
grep -hPo '<h2 class="entry-title"><a href="\K.+?(?=")' *.html > urls.txt
A few useful tools
# sort the links and make sure they are unique sort -u myfile.txt # shuffle the URLs shuf myfile.txt
The Python tool for web text extraction I am working on (trafilatura) can process a list of URLs and find the main text along with useful metadata. It can be used from Python or on the command-line:
trafilatura -i list_of_urls.txt
Finally, please note that it is not considered to be fair to retrieve a large number of documents from a website in a short period of time. For more information, please check the robots exclusion standard and the
robots.txt file for websites who enforce it.