Rule-based URL cleaning for text collections
I would like to introduce the way I clean lists of unknown URLs before going further (e.g. by retrieving the documents). I often use a Python script named clean_urls.py which I made available under a open-source license as a part of the FLUX-toolchain.
The following Python-based regular expressions show how malformed URLs, URLs leading to irrelevant content as well as URLs which obviously lead to adult content and spam can be filtered using a rule-based approach.
Avoid recurrent sites and patterns to save bandwidth
First, it can be useful to make sure that the URL was properly …
more ...