Building a basic specialized crawler
As I went on crawling again in the last few days I thought it could be helpful to describe the way I do.
Note that it is for educational purpose only (I am not assuming that I built the fastest and most reliable crawling engine ever) and that the aim is to crawl specific pages of interest. That implies I know which links I want to follow just by regular expressions, because I observe how a given website is organized.
I see two (or eventually three) steps in the process, which I will go through giving a few hints in …
more ...