Overview of URL analysis and classification methods

The analysis of URLs using natural language processing methods has recently become a research topic by itself, all the more since large URL lists are considered as being part of the big data paradigm. Due to the quantity of available web pages and the costs of processing large amounts of data, it is now an Information Retrieval task to try to classify web pages merely by taking their URLs into account and without fetching the documents they link to.

Why is that so and what can be taken away from these methods?

Interest and objectives

Obviously, the URLs contain clues …

more ...

Introducing the Microblog Explorer

The Microblog Explorer project is about gathering URLs from social networks (FriendFeed, identi.ca, and Reddit) to use them as web crawling seeds. At least by the last two of them a crawl appears to be manageable in terms of both API accessibility and corpus size, which is not the case concerning Twitter for example.

Hypotheses:

  1. These platforms account for a relative diversity of user profiles.
  2. Documents that are most likely to be important are being shared.
  3. It becomes possible to cover languages which are more rarely seen on the Internet, below the English-speaking spammer’s radar.
  4. Microblogging services are …
more ...

What is good enough to become part of a web corpus?

I recently worked at the FU Berlin with Roland Schäfer and Felix Bildhauer on issues related to web corpora. One of them deals with corpus construction: as a matter of fact, web documents can be very different, and even after a proper cleaning it is not rare to see things that could hardly be qualified as texts. While there are indubitably clear cases such as lists of addresses or tag clouds, it is not always obvious to define how extensive the notions of text and corpus are. What’s more, a certain amount of documents just end up too close …

more ...

Feeding the COW at the FU Berlin

I am now part of the COW project (COrpora on the Web). The project has been carried by (amongst others) Roland Schäfer and Felix Bildhauer at the FU Berlin for about two years. Work has already been done, especially concerning long-haul crawls in several languages.

Resources

A few resources have already been made available, software, n-gram models as well as web-crawled corpora, which for copyright reasons are not downloadable as a whole. They may be accessed through a special interface (COLiBrI – COW’s Light Browsing Interface) or downloaded upon request in a scrambled form (all sentences randomly reordered).

This is …

more ...

Two open-source corpus-builders for German and French

Introduction

I already described how to build a basic specialized crawler on this blog. I also wrote about crawling a newspaper website to build a corpus. As I went on work on this issue, I decided to release a few useful scripts under an open-source license.

The crawlers are not just mere link-harvesters, they are designed to be used as corpus-builders. As one cannot republish anything but quotations of the texts, the purpose is to enable others to make their own version of the corpora. Since the newspapers are updated quite often, it is not imaginable to create exact duplicates …

more ...

Crawling a newspaper website to build a corpus

Basing on my previous post about specialized crawlers, I will show how I to crawl a French sports newspaper named L’Equipe using scripts written in Perl, which I did lately. For educational purpose, it works by now but it is bound to stop being efficient as soon as the design of the website changes.

Gathering links

First of all, you have to make a list of links so that you have something to start from. Here is the beginning of the script:

#!/usr/bin/perl #assuming you're using a UNIX-based system...
use strict; #because it gets messy without …
more ...

Building a basic specialized crawler

As I went on crawling again in the last few days I thought it could be helpful to describe the way I do.

Note that it is for educational purpose only (I am not assuming that I built the fastest and most reliable crawling engine ever) and that the aim is to crawl specific pages of interest. That implies I know which links I want to follow just by regular expressions, because I observe how a given website is organized.

I see two (or eventually three) steps in the process, which I will go through giving a few hints in …

more ...

Using and parsing the hCard microformat, an introduction

Recently, as I decided to get involved in the design of my personal page, I learned how to represent semantic markup on a web page. I would like to share a few things about writing and parsing semantic information in this format. I have the intuition that it is only the beginning and that there will be more and more formats to describe who you are, what do you do, who your are related to, where you link to, and engines that gather these informations.

First of all, the hCard microformat points to this standard, hCard 1.0.1.  For …

more ...

Collecting academic papers

I would like to build a corpus from a variety of scientific papers of a given field in a given language (german).

The problems of crawling put aside, I wonder if there is a way to do this automatically. All the papers I read deal with hand-collected corpora.

The Open Archive format might be a good workaround (see The Open Archives Initiative Protocol for Metadata Harvesting). As far as I know it is well-spread. And there are search engines that look for academic papers and use these metadata.

Among the most popular ones (Google Scholar, Scirus, OAIster), a few seem …

more ...