I would like to build a corpus from a variety of scientific papers of a given field in a given language (german).
The problems of crawling put aside, I wonder if there is a way to do this automatically. All the papers I read deal with hand-collected corpora.
The Open Archive format might be a good workaround (see The Open Archives Initiative Protocol for Metadata Harvesting). As far as I know it is well-spread. And there are search engines that look for academic papers and use these metadata.
Among the most popular ones (Google Scholar, Scirus, OAIster), a few seem to deal with a lot of german texts : Scientific Commons (St. Gallen, CH) and Base (Bielefeld).
I read an interesting article today about the search engines regarding this particular field: by Dirk Pieper and Sebastian Wolf from the University Library of Bielefeld, “Wissenschaftliche Dokumente in Suchmaschinen”, in Handbuch Internet-Suchmaschinen, D. Lewandowski (ed.), Heidelberg, 2009. PDF version here.
I could crawl the result pages of a given website and see what I get. I’ll see what I can do.