I would like to introduce the way I clean lists of unknown URLs before going further (e.g. by retrieving the documents). I often use a Python script named clean_urls.py which I made available under a open-source license as a part of the FLUX-toolchain.
The following Python-based regular expressions show how malformed URLs, URLs leading to irrelevant content as well as URLs which obviously lead to adult content and spam can be filtered using a rule-based approach.
Avoid recurrent sites and patterns to save bandwidth
First, it can be useful to make sure that the URL was properly parsed before making it into the list, the very first step would be to check whether it starts with the right protocol (ftp is deemed irrelevant in my case).
protocol = re.compile(r'^http', re.IGNORECASE)
Then, it is necessary to find and extract URLs nested inside of a URL: referrer URLs, links which were not properly parsed, etc.
match = re.search(r'^http.+?(https?://.+?$)', line)
After that, I look at the end of the URLset rid of URLs pointing to files which are frequent but obviously not text-based, both at the end and inside the URL:
# obvious extensions
extensions = re.compile(r'\.atom$|\.json$|\.css$|\.xml$|\.js$|\.jpg$|\.jpeg$|\.png$|\.gif$|\.tiff$|\.pdf$|\.ogg$|\.mp3$|\.m4a$|\.aac$|\.avi$|\.mp4$|\.mov$|\.webm$|\.flv$|\.ico$|\.pls$|\.zip$|\.tar$|\.gz$|\.iso$|\.swf$', re.IGNORECASE)
# frequent media query schemes, just in case
mediaquery = re.compile(r'\.jpg[&?]|\.jpeg[&?]|\.png[&?]|\.gif[&?]|\.pdf[&?]|\.ogg[&?]|\.mp3[&?]|\.avi[&?]|\.mp4[&?]', re.IGNORECASE)
In my case it saves me much time and bandwidth to get rid of frequent advertising and multimedia URL patterns:
notsuited = re.compile(r'^http://add?s?\.|^http://banner\.|doubleclick|tradedoubler\.com|livestream|live\.|videos?\.|feed$|rss$', re.IGNORECASE)
Finally, it is advisable to avoid popular, central websites which do not usually fall into the scope of study and which could cause a waste of resources or lead the software to be blocked, because they require a particular approach:
hostnames_filter = re.compile(r'last\.fm|soundcloud\.com|youtube\.com|youtu\.be|vimeo\.com|instagr\.am|instagram\.com|imgur\.com|flickr\.com|google\.|twitter\.com|twitpic\.com|gravatar\.com|akamai\.net|amazon\.com|cloudfront\.com', re.IGNORECASE)
Blacklisting and rule-based filtering of spam and adult content
It is also possible to use a blacklist of URLs or domain names as input. Such a list can be retrieved from shallalist.de using a script I wrote (available on GitHub), which focuses on a particular subset of spam categories.
Last, since adult content is everywhere and usually well-interlinked (at least in my experience), it saves times to look for obvious patterns. I compiled the following list using large URL lists and paying attention to false positives. I am pretty sure now that if the precision could be improved, the recall of this method is good, since the patterns are clear enough not to be used by people not hosting this kind of content.
# re.IGNORECASE flag or line.lower()
if re.search(r'[\./_-](porno?|xxx)', re.IGNORECASE) or re.search(r'(cams|cash|porno?|sex|xxx)[\./_-]', re.IGNORECASE) or re.search(r'gangbang|incest', re.IGNORECASE) or re.search(r'[\./_-](adult|ass|sex)[\./_-]', re.IGNORECASE):
passing_test = False
I hope it helps!
Update: The courlan library now integrates all the ideas described above, check it out!