Finding viable seed URLs for web corpora

I recently attended the Web as Corpus Workshop in Gothenburg, where I had a talk for a paper of mine, Finding viable seed URLs for web corpora: a scouting approach and comparative study of available sources, and another with Felix Bildhauer and Roland Schäfer, Focused Web Corpus Crawling.

Summary

The comparison I did started from web crawling experiments I performed at the FU Berlin. The fact is that the conventional tools of the “Web as Corpus” framework rely heavily on URLs obtained from search engines. URLs were easily gathered that way until search engine companies restricted this allowance, meaning that one now has to pay and/or to wait longer to send queries.

I tried to evaluate the leading approach and to find decent subtitutes using social networks as well as the Open Directory Project and Wikipedia. I take four different languages (Dutch, French, Indonesian and Swedish) as examples in order to compare several web spaces with different if not opposed characteristics.

My results distinguish no clear winner, complementary approaches are called for, and it seems possible to replace or at least to complement the existing BootCaT approach. I think that crawling problems such as link/host diversity have not ...

more ...

Challenges in web corpus construction for low-resource languages

I recently presented a paper at the third LRL Workshop (a joint LTC-ELRA-FLaReNet-META_NET workshop on “Less Resourced Languages, new technologies, new challenges and opportunities”).

Motivation

The state of the art tools of the “web as corpus” framework rely heavily on URLs obtained from search engines. Recently, this querying process became very slow or impossible to perform on a low budget.

Moreover, there are diverse and partly unknown search biases related to search engine optimization tricks and undocumented PageRank adjustments, so that diverse sources of URL seeds could at least ensure that there is not a single bias, but several ones. Last, the evolving web document structure and a shift from “web AS corpus” to “web FOR corpus” (increasing number of web pages and the necessity to use sampling methods) complete what I call the post-BootCaT world in web corpus construction.

Study: What are viable alternative data sources for lesser-known languages?

Trying to find reliable data sources for Indonesian, a country with a population of 237,424,363 of which 25.90 % are internet users (2011, official Indonesian statistics institute), I performed a case study of different kinds of URL sources and crawling strategies.

First, I classified URLs extracted ...

more ...

Guessing if a URL points to a WordPress blog

I am currently working on a project for which I need to identify WordPress blogs as fast as possible, given a list of URLs. I decided to write a review on this topic since I found relevant but sparse hints on how to do it.

First of all, let’s say that guessing if a website uses WordPress by analysing HTML code is straightforward if nothing was been done to hide it, which is almost always the case. As WordPress is one of the most popular content management systems, downloading every page and performing a check afterward is an option that should not be too costly if the amount of web pages to analyze is small. However, downloading even a reasonable number of web pages may take a lot of time, that is why other techniques have to be found to address this issue.

The way I chose to do it is twofold, the first filter is URL-based whereas the final selection uses HTTP HEAD requests.

URL Filter

There are webmasters who create a subfolder named “wordpress” which can be seen clearly in the URL, providing a kind of K.O. victory. If the URLs points to a non-text ...

more ...

Overview of URL analysis and classification methods

The analysis of URLs using natural language processing methods has recently become a research topic by itself, all the more since large URL lists are considered as being part of the big data paradigm. Due to the quantity of available web pages and the costs of processing large amounts of data, it is now an Information Retrieval task to try to classify web pages merely by taking their URLs into account and without fetching the documents they link to.

Why is that so and what can be taken away from these methods ?

Interest and objectives

Obviously, the URLs contain clues regarding the ressource they point to. The URL analysis is about getting as much information as possible to try to predict several characteristics of a web page. The results may influence the way the URL is processed: prioritization, delay, building of focused URL groups, etc.

The main goal seems to be to save crawling time, bandwidth and disk space, which are issues everyone confronted to web-scale crawling has to deal with.

However, one could also argue that it is sometimes hard to figure out what hides behind a URL. Kan & Thi (2005) tackle this issue under the assumption that there ...

more ...