The analysis of URLs using natural language processing methods has recently become a research topic by itself, all the more since large URL lists are considered as being part of the big data paradigm. Due to the quantity of available web pages and the costs of processing large amounts of data, it is now an Information Retrieval task to try to classify web pages merely by taking their URLs into account and without fetching the documents they link to.
Why is that so and what can be taken away from these methods?
Interest and objectives
Obviously, the URLs contain clues …more ...