Filtering links to gather texts on the web
The issue with URLs and URIs
A Uniform Resource Locator (URL), colloquially termed a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifier (URI).
Both navigation on the Web and web crawling rely on the assumption that “the Web is a space in which resources are identified by Uniform Resource Identifiers (URIs).” (Berners-Lee et al., 2006) That being said, URLs cannot be expected to be entirely reliable. Especially as part of the Web 2.0 content on the Web is changing faster than ever before, it can be tailored to a particular geographic or linguistic profile and isn’t stable in time.
Although URLs cannot be expected to be perfect predictors for the content which gets downloaded, they are often the only indication according to which crawling strategies are developed. It can be really useful to identify and discard redundant URIs, that is different URIs leading to similar text, which can also be called DUST (Schonfeld et al. 2006). Refining and filtering steps relie on URL components such as host/domain name, path and parameters/query …
more ...