As I went on crawling again in the last few days I thought it could be helpful to describe the way I do.

Note that it is for educational purpose only (I am not assuming that I built the fastest and most reliable crawling engine ever) and that the aim is to crawl specific pages of interest. That implies I know which links I want to follow just by regular expressions, because I observe how a given website is organized.

I see two (or eventually three) steps in the process, which I will go through giving a few hints in pseudocode.

A shell script

You might want to write a shell script to fire the two main phases automatically and/or to save your results on a regular basis (if something goes wrong after a reasonable amount of explored pages you don’t want to lose all the work, even if it’s mainly CPU time and electricity).

A list of links

If the website has an archive, a sitemap or a general list of its contents you can spare time by picking the interesting links once and for all.

going through a shortlist of archives DO {      fetch page      find links      FOR each link do {           IF it matches a given regular expression {           store in a list           }      }      IF there are other results or archives pages DO {           FOR all the pages to see {           add ?p=... or ?page=... or anything to the last seen page           fetch page, find links, FOR each link DO...           }      } } remove the duplicate items from the list WRITE to file (after a last control)

A page explorer

The main component is the module that indexes or in my case selects the desired content and stores it in a file.

going through the list of pages DO {      fetch page #provided there are many informations but you do not want them all it goes faster that way      cut the top and the bottom #provided you want to extract informations such as title, date, etc.      IF there is something like a div class=title or h1 {           extract it, clean it and store it      } ...      look for the paragraphs of the text, clean and store them      write the text with the information desired to file }

Remarks :

  • You can also scrap the links on the fly, it lacks the systematic approach but you can do it on every website. Moreover, it’s funny to see how far you can go starting from one single page.
  • If there are several pages, you can either change the url before you fetch the page (if there is a “text on one page” option) or you can proceed as described by the links.
  • If you want to store the informations in a XML format you need to substitute a few characters like ” and &.
  • I am talking about “pages” where some prefer to use the word “documents”.

Update:

For an example, see the following post: Crawling a newspaper website to build a corpus.