I am currently working on a project for which I need to identify WordPress blogs as fast as possible, given a list of URLs. I decided to write a review on this topic since I found relevant but sparse hints on how to do it.
First of all, let’s say that guessing if a website uses WordPress by analysing HTML code is straightforward if nothing was been done to hide it, which is almost always the case. As WordPress is one of the most popular content management systems, downloading every page and performing a check afterward is an option that should not be too costly if the amount of web pages to analyze is small. However, downloading even a reasonable number of web pages may take a lot of time, that is why other techniques have to be found to address this issue.
The way I chose to do it is twofold, the first filter is URL-based whereas the final selection uses HTTP HEAD requests.
URL Filter
There are webmasters who create a subfolder named “wordpress” which can be seen clearly in the URL, providing a kind of K.O. victory. If the URLs points to a non-text document, the default settings create a “wp-content” subfolder, which the URL is bound to feature, thus leading to another clear case.
An overview of patterns used by WordPress is available on their website on their Using Permalinks page.
In fact, what WordPress calls “permalinks settings” defines five common URL structures as well as a vocabulary to write a customized one. Here are the so-called “common setttings” (which almost every concerned website uses, one way or another):
- default:
?p=
or?page_id=
or?paged=
- date:
/year/
and/or/month/
and/or/day/
and so on - post number:
/keyword/number
(where keyword is for example “archives”) - tag or category:
/tag/
or/category/
- post name: with very long URLs containing a lot of hyphens
The first three patterns yield good results in practice, the only problem with dates being news websites which tend to use dates very frequently in URLs. In that case the accuracy of the prediction is poor.
The last pattern is used broadly, it does not say a lot about a website apart from it being prone to feature search engine optimization techniques. Whether one wants to take advantage of it or not mostly depends on objectives with respect to recall and on the characteristics of the URL set, that is to say on one hand whether all possible URLs are to be covered and on the other hand whether this pattern seems to be significant. It also depends on how much time one may waste running the second step.
Examples of regular expressions:
20[0-9]{2}/[0-9]{2}/
/tag/|/category/|\?cat=
HEAD requests
The URL analysis relies on the standard configuration and this step even more so. Customized websites are not easy to detect, and most of the criteria listed here will fail at them.
These questions on wordpress.stackexchange.com gave me the right clues to start performing these requests:
- Detecting a WordPress URL without doing a full HTTP GET?
- Steps to Take to Hide the Fact a Site is Using WordPress?
HEAD requests are part of the HTTP protocol. Like the most frequent request, GET, which fetches the content, they are supposed to be implemented by every web server. A HEAD requests asks for the meta-information written in response headers without downloading the actual content. That is why no webpage is actually “seen” during the process, which makes it a lot faster.
One or several requests per domain name are sufficient, depending on the desired precision:
- A request sent to the homepage is bound to yield pingback information to use via the XMLRPC protocol. This is the “X-Pingback” header. Note that if there is a redirect, this header usually points to the “real” domain name and/or path + “xmlrpc.php”
- A common extension speeds up page downloads by creating a cache
version of every page on the website. This extension adds a
“WP-Super-Cache” header to the rest. If the first hint may not
be enough to be sure the website is using WordPress, this one does
the trick.
NB: there are webmasters who deliberately give false information, but they are rare. - A request sent to “/login” or “/wp-login.php” should yield a HTTP status like 2XX or 3XX, a 401 can also happen.
- A request sent to “/feed” or “/wp-feed.php” should yield the header “Location”.
The criteria listed above can be used separately or in combination. I chose to use a kind of simple decision tree.
Sending more than one request makes the guess more precise, it also enables to detect redirects and check for the “real” domain name. As this operation sometimes really helps deduplicating a URL list, it is rarely a waste of time.
Last, let’s mention that it is useful to exclude a few common false positives, rule out using this kind of regular expression:
\.blogspot\.|\.google\.|\.tumblr\.|\.typepad\.com|\.wp\.com|\.archive\.|akamai|fbcdn|baidu\.com
Update: The courlan library now integrates all the ideas described above, check it out!