Diverse content extraction and scraping techniques are routinely used on web document collections by companies and research institutions alike. Being able to better qualify the contents allows for insights based on metadata (e.g. content type, authors or categories), better bandwidth control (e.g. by knowing when webpages have been updated), or optimization of indexing (e.g. language-based heuristics, LRU cache, etc.).
In short, metadata extraction is useful for different kinds of purposes ranging from knowledge extraction and business intelligence to classification and refined visualizations. It is often necessary to fully parse the document or apply robust scraping patterns, there are for example webpages for which neither the URL nor the server response provide a reliable way to date the document, that is find when it was written.
I regularly work on improving the extraction methods for the web collections at my home institutions. They are unique as they combine both the quantity resulting from broad web crawling and the quality obtained by carefully extracting text and metadata as well as rejecting documents that do not match certain criteria. In that sense, I already published work on methods to derive metadata from web documents in order ...more ...