Using sitemaps to crawl websites

In order to gather web documents it can be useful to download the portions of a website programmatically. This post shows ways to find URLs within a website and to work with URL lists on the command-line.

For general information on command-line operations please refer to Comment Prompt (tutorial for Windows systems), How to use the Terminal command line in macOS, or An introduction to the Linux Terminal.

Download of sitemaps and extraction of URLs

A sitemap is a file that lists the visible URLs for a given site, the main goal being to reveal where machines can look for content. The retrieval and download of documents within a website is often called crawling. The sitemaps protocol allows a webmaster to inform search engines about URLs on a website that are available for crawling. Sitemaps follow the XML format, so each sitemap is or should be a valid XML file.

Download and filtering

A sitemap.xml file is usually located at the root of a website. If it is present, it is almost always to be found by appending the file name after the domain name and a slash: https://www.sitemaps.org becomes https://www.sitemaps.org/sitemap.xml …

more ...

Ad hoc and general-purpose corpus construction from web sources

While the pervasiveness of digital communication is undeniable, the numerous traces left by users-customers are collected and used for commercial purposes. The creation of digital research objects should provide the scientific community with ways to access and analyze them. Particularly in linguistics, the diversity and quantity of texts present on the internet have to be better assessed in order to make current text corpora available, allowing for the description of the variety of languages uses and ongoing changes. In addition, transferring the field of analysis from traditional written text corpora to texts taken from the web results in the creation of new tools and new observables. We must therefore provide the necessary theoretical and practical background to establish scientific criteria for research on these texts.

This is the subject of my PhD work which has been performed under the supervision of Benoît Habert and which led to a thesis entitled Ad hoc and general-purpose corpus construction from web sources, defended on June 19th 2015 at the École Normale Supérieure de Lyon to obtain the degree of doctor of philosophy in linguistics.

Methodological considerations

At the beginning of the first chapter the interdisciplinary setting between linguistics, corpus linguistics, and computational linguistics …

more ...