Bits of Language: corpus linguistics, NLP and text analytics

“Googleology is bad science”: Anatomy of a web corpus infrastructure

This post discusses a seminal article on corpus linguistics by Adam Kilgarriff. It shows which challenges arise when dealing with web corpora and how a corresponding infrastructure can be developed.

more ...

How to download web pages in parallel and follow politeness rules in Python

Optimizing downloads is crucial to gather data from a series of websites. However, one should respect “politeness” rules. Here is a simple way keep an eye on all these constraints as once.

more ...

An easy way to save time and resources: content-aware URL filtering

Avoid wasting bandwidth capacity and processing time for webpages which are probably not worth the effort. Stay away from pages with little text in the target language or focus on other pages to gather links.

more ...

Web scraping with R: Text and metadata extraction

Why choose between R and Python when you can have both? This tutorial shows how to install a Python scraper and use it for content discovery and text extraction, all straight from R.

more ...

Using sitemaps to crawl websites on the command-line

Sitemaps are particularly useful for web crawling, so that machines can more intelligently crawl the site. The post entails all necessary code snippets to optimize link discovery and filtering and to work with sitemaps on the command-line.

more ...

Ad hoc and general-purpose corpus construction from web sources

The diversity and quantity of texts present on the Internet have to be better assessed to allow for the description of language with its diversity and change. Focusing on actual construction processes leads to better corpus design, beyond simple collections or heterogeneous resources.

more ...

Bibliography

Work in progress towards a page listing (web) corpus linguistics references and resources.

Corpus Linguistics and Corpus Building

The Routledge Handbook of Corpus Linguistics, 1 ed., O’Keeffe, A. and McCarthy, M., Eds., London, New York: Routledge, 2010.
N. Bubenhofer, Einführung in die Korpuslinguistik: Praktische Grundlagen und Werkzeuge, Zürich:2009.
S. Loiseau, “Corpus, quantification et typologie textuelle”, Syntaxe et sémantique, vol. 9, pp. 73-85, 2008.
C. Draxler, Korpusbasierte Sprachverarbeitung, Günter Narr, 2008. M. Cori, “Des méthodes de traitement automatique aux linguistiques fondées sur les corpus”, Langages, vol. 171, iss. 3, pp. 95-110 …

more ...

Finding viable seed URLs for web corpora

I recently attended the Web as Corpus Workshop in Gothenburg, where I had a talk for a paper of mine, Finding viable seed URLs for web corpora: a scouting approach and comparative study of available sources, and another with Felix Bildhauer and Roland Schäfer, Focused Web Corpus Crawling.

Summary

The comparison I did started from web crawling experiments I performed at the FU Berlin. The fact is that the conventional tools of the “Web as Corpus” framework rely heavily on URLs obtained from search engines. URLs were easily gathered that way until search engine companies restricted this allowance, meaning that …

more ...

Challenges in web corpus construction for low-resource languages

I recently presented a paper at the third LRL Workshop (a joint LTC-ELRA-FLaReNet-META_NET workshop on “Less Resourced Languages, new technologies, new challenges and opportunities”).

Motivation

The state of the art tools of the “web as corpus” framework rely heavily on URLs obtained from search engines. Recently, this querying process became very slow or impossible to perform on a low budget.

Moreover, there are diverse and partly unknown search biases related to search engine optimization tricks and undocumented PageRank adjustments, so that diverse sources of URL seeds could at least ensure that there is not a single bias, but …

more ...

Guessing if a URL points to a WordPress blog

I am currently working on a project for which I need to identify WordPress blogs as fast as possible, given a list of URLs. I decided to write a review on this topic since I found relevant but sparse hints on how to do it.

First of all, let’s say that guessing if a website uses WordPress by analysing HTML code is straightforward if nothing was been done to hide it, which is almost always the case. As WordPress is one of the most popular content management systems, downloading every page and performing a check afterward is an option …

more ...