Building a basic specialized crawler
As I went on crawling again in the last few days I thought it could be helpful to describe the way I do.
Note that it is for educational purpose only (I am not assuming that I built the fastest and most reliable crawling engine ever) and that the aim is to crawl specific pages of interest. That implies I know which links I want to follow just by regular expressions, because I observe how a given website is organized.
I see two (or eventually three) steps in the process, which I will go through giving a few hints in pseudocode.
A shell script
You might want to write a shell script to fire the two main phases automatically and/or to save your results on a regular basis (if something goes wrong after a reasonable amount of explored pages you don’t want to lose all the work, even if it’s mainly CPU time and electricity).
A list of links
If the website has an archive, a sitemap or a general list of its contents you can spare time by picking the interesting links once and for all.
going through a shortlist of archives DO { fetch page find …