The analysis of URLs using natural language processing methods has recently become a research topic by itself, all the more since large URL lists are considered as being part of the big data paradigm. Due to the quantity of available web pages and the costs of processing large amounts of data, it is now an Information Retrieval task to try to classify web pages merely by taking their URLs into account and without fetching the documents they link to.
Why is that so and what can be taken away from these methods ?
Interest and objectives
Obviously, the URLs contain clues regarding the ressource they point to. The URL analysis is about getting as much information as possible to try to predict several characteristics of a web page. The results may influence the way the URL is processed: prioritization, delay, building of focused URL groups, etc.
The main goal seems to be to save crawling time, bandwidth and disk space, which are issues everyone confronted to web-scale crawling has to deal with.
However, one could also argue that it is sometimes hard to figure out what hides behind a URL. Kan & Thi (2005) tackle this issue under the assumption that there is noise in other methods as well, such as textual and summarization data, so that URL classification could be a way to efficiently bypass or complement them.
I recently attented a talk at the ACL 2013 where the authors described an easy way to build parallel corpora from the web for machine translation training. I may analyze it in a future post.
URL analysis seems to be considered as a classification task, where the goal is to predict the language or the genre of web pages. Following the trend towards artificial intelligence, machine learning techniques are applied in order to discriminate between several known classes. In most cases this means that URLs are mapped to numerical feature vectors, so that feature discovery (or selection) as well as feature weights are at the center of the interest.
There are two main approaches to URL analysis, on one hand token-based and on the other n-gram-based features. In both cases, there are no encoding problems since URL encoding can be expected to be constant:
“Unlike n-grams for feature extraction from webpages, using n-grams in feature extraction from URLs is less susceptible to evolutionary encoding changes.” (Abramson & Aha 2002)
Besides, researchers trying to detect spam or fishing URLs also use host-based features, like WHOIS or PageRank information (Ma et al. 2009), which are highly relevant in that case, so that they do not rely primarily on fine-grained URL analysis.
The token-based approach is mainly about extracting dictionary words from parts of the URL. It is not easy, mainly because the words may be concatenated and because parts of the URL may be in other language or rely on different expectations and practises. The supporters of n-gram-based analysis tend to think that token detection is too unreliable and that n-grams are also bound to capture the semantic dimension which tokens may reveal.
Anyway, n-gram based approaches are not stricly based on n-grams, their components are rather called “all-grams” (for instance everything from 3 to 6 characters). Moreover, there are papers like (Abramson & Aha 2012) where n-grams may be decomposed: “the n-grams are extracted on a sliding window of size n from the URL string and then decomposed when needed”. In fact, even for an n-gram-based approach to capture n-grams between tokens and not within a token is not productive.
All in all, the first intuition everyone could have regarding the top-level domain (TLD) is misleading when it comes to predict the language of the resource. Apart from the fact that the majority of web pages are in English, which may boost the results for a few domain names, and punctual successes in particular cases, the value of the TLD should not be overestimated.
For Baykan, Henzinger & Weber (2008), it even proves to be a poor indicator because of the heterogeneous nature of the TLDs .com and .org. The authors state that for applications where recall is important ccTLD (country code top-level domains) and what they call ccTLD+ (including .com and .org being considered as English) should not be used as language classifiers.
Still according to Baykan, Henzinger & Weber (2008), trigrams are the most widely used feature for language identification since they outperform the common word approach when the text is short without losing on ground when the texts are longer. As the “texts” are definitely short in the case of URLs it seems to be the way to go.
Word segmentation techniques by Kan & Thi (2005) are divided into four categories: statistical, dictionary-based, syntax-based methods and conceptual methods. Their main statistical indicator is the segmentation by information content (or entropy) reduction, a token can be split “if the partitioning’s entropy is lower than the token’s”.
Finally, there are also dictionary-based methods which use list of words indicative for a certain topic or language (Baykan et al. 2009), obtained from existing directories like the Open Directory Project.
First of all, results show that classification from URLs can give equal or even surprisingly better results than classification from webpages for genre classification (Abramson & Aha 2002, Kan & Thi 2005). This also explains why URL analysis tends to become a research topic per se.
Kan & Thi (2005) implement a detailed analysis but the results cannot be compared easily to other papers: it would be interesting to know how effective the token segmentation really is on the other testsets. In fact, according to the authors, their classifier performs well on long URLs but not so on typical web site entry points.
According to Baykan, Henzinger & Weber (2008), word-based features performed best regarding language identification, with trigrams being remarkably interesting if the set of training URLs is small. Concerning topic classification, all-grams give the best results and tokens the worst (Baykan et al. 2009).
Baykan, Henzinger & Weber (2008) raise convincing questions about evaluation methodology: if a binary classification is performed (i.e. the URL belongs to a given category or not), then the classifier to evaluate should be checked against a weighted number of positive and negative results. Otherwise, the results are biased since the baseline is high and since an intentionally loose or tight classifier can achieve good results. The authors state that positive and negative recall are better metrics in that case than precision or F-measure.
Overall, the n-grams approach seems to work with little training data and to achieve decent results in all cases. The way the URL is segmented into tokens is crucial, and it may explain at least partly why a classifier fails. Most of the tasks reviewed here were binary classification tasks, other questions arise when there has to be more than two classes, most notably the huge number of languages and topics to be found on the web and the difficulty to build exhaustive lists and/or relevant groups.
- Abramson M. and Aha D.W.(2012), “What’s in a URL? Genre Classification from URLs”, in Intelligent Techniques for Web Personalization and Recommender Systems. AAAI Technical Report. Association for the Advancement of Artificial Intelligence.
- Baykan E., Henzinger M., Marian L. and Weber I. (2009), “Purely URL-based topic classification”, in Proceedings of the 18th international conference on World Wide Web, pp. 1109-1110.
- Baykan E., Henzinger M. and Weber I. (2008), “Web Page Language Identification Based on URLs”, in Proceedings of the VLDB Endowment. Vol. 1(1), pp. 176-187. VLDB Endowment.
- Kan M.-Y. and Thi H.O.N.(2005), “Fast webpage classification using URL features”, in Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 325-334. Association for Computing Machinery.
- Ma J., Saul L.K., Savage S. and Voelker G.M.(2009), “Identifying suspicious URLs: an application of large-scale online learning”, in Proceedings of the 26th Annual International Conference on Machine Learning, pp. 681-688.