Using sitemaps to crawl websites on the command-line

This post describes practical ways to crawl websites and by working with sitemaps on the command-line. Sitemaps are particularly useful since they are made so that machines can more intelligently crawl the site. The post entails all necessary code snippets to optimize link discovery and filtering.

more ...

Ad hoc and general-purpose corpus construction from web sources

While the pervasiveness of digital communication is undeniable, the numerous traces left by users-customers are collected and used for commercial purposes. The creation of digital research objects should provide the scientific community with ways to access and analyze them. Particularly in linguistics, the diversity and quantity of texts present on the internet have to be better assessed in order to make current text corpora available, allowing for the description of the variety of languages uses and ongoing changes. In addition, transferring the field of analysis from traditional written text corpora to texts taken from the web results in the creation …

more ...

Bibliography

Work in progress towards a page listing (web) corpus linguistics references and resources.

Summary

Corpus Linguistics and Corpus Building

  • The Routledge Handbook of Corpus Linguistics, 1 ed., O’Keeffe, A. and McCarthy, M., Eds., London, New York: Routledge, 2010.
  • N. Bubenhofer, Einführung in die Korpuslinguistik: Praktische Grundlagen und Werkzeuge, Zürich:2009.
  • S. Loiseau, “Corpus, quantification et typologie textuelle”, Syntaxe et sémantique, vol. 9, pp. 73-85, 2008.
  • C. Draxler, Korpusbasierte Sprachverarbeitung, Günter Narr, 2008. M. Cori, “Des méthodes de traitement automatique aux linguistiques fondées sur les corpus”, Langages, vol. 171, iss. 3, pp. 95-110 …
more ...

Finding viable seed URLs for web corpora

I recently attended the Web as Corpus Workshop in Gothenburg, where I had a talk for a paper of mine, Finding viable seed URLs for web corpora: a scouting approach and comparative study of available sources, and another with Felix Bildhauer and Roland Schäfer, Focused Web Corpus Crawling.

Summary

The comparison I did started from web crawling experiments I performed at the FU Berlin. The fact is that the conventional tools of the “Web as Corpus” framework rely heavily on URLs obtained from search engines. URLs were easily gathered that way until search engine companies restricted this allowance, meaning that …

more ...

Challenges in web corpus construction for low-resource languages

I recently presented a paper at the third LRL Workshop (a joint LTC-ELRA-FLaReNet-META_NET workshop on “Less Resourced Languages, new technologies, new challenges and opportunities”).

Motivation

The state of the art tools of the “web as corpus” framework rely heavily on URLs obtained from search engines. Recently, this querying process became very slow or impossible to perform on a low budget.

Moreover, there are diverse and partly unknown search biases related to search engine optimization tricks and undocumented PageRank adjustments, so that diverse sources of URL seeds could at least ensure that there is not a single bias, but …

more ...

Guessing if a URL points to a WordPress blog

I am currently working on a project for which I need to identify WordPress blogs as fast as possible, given a list of URLs. I decided to write a review on this topic since I found relevant but sparse hints on how to do it.

First of all, let’s say that guessing if a website uses WordPress by analysing HTML code is straightforward if nothing was been done to hide it, which is almost always the case. As WordPress is one of the most popular content management systems, downloading every page and performing a check afterward is an option …

more ...

Overview of URL analysis and classification methods

The analysis of URLs using natural language processing methods has recently become a research topic by itself, all the more since large URL lists are considered as being part of the big data paradigm. Due to the quantity of available web pages and the costs of processing large amounts of data, it is now an Information Retrieval task to try to classify web pages merely by taking their URLs into account and without fetching the documents they link to.

Why is that so and what can be taken away from these methods?

Interest and objectives

Obviously, the URLs contain clues …

more ...

Introducing the Microblog Explorer

The Microblog Explorer project is about gathering URLs from social networks (FriendFeed, identi.ca, and Reddit) to use them as web crawling seeds. At least by the last two of them a crawl appears to be manageable in terms of both API accessibility and corpus size, which is not the case concerning Twitter for example.

Hypotheses:

  1. These platforms account for a relative diversity of user profiles.
  2. Documents that are most likely to be important are being shared.
  3. It becomes possible to cover languages which are more rarely seen on the Internet, below the English-speaking spammer’s radar.
  4. Microblogging services are …
more ...

What is good enough to become part of a web corpus?

I recently worked at the FU Berlin with Roland Schäfer and Felix Bildhauer on issues related to web corpora. One of them deals with corpus construction: as a matter of fact, web documents can be very different, and even after a proper cleaning it is not rare to see things that could hardly be qualified as texts. While there are indubitably clear cases such as lists of addresses or tag clouds, it is not always obvious to define how extensive the notions of text and corpus are. What’s more, a certain amount of documents just end up too close …

more ...

Feeding the COW at the FU Berlin

I am now part of the COW project (COrpora on the Web). The project has been carried by (amongst others) Roland Schäfer and Felix Bildhauer at the FU Berlin for about two years. Work has already been done, especially concerning long-haul crawls in several languages.

Resources

A few resources have already been made available, software, n-gram models as well as web-crawled corpora, which for copyright reasons are not downloadable as a whole. They may be accessed through a special interface (COLiBrI – COW’s Light Browsing Interface) or downloaded upon request in a scrambled form (all sentences randomly reordered).

This is …

more ...