Using sitemaps to crawl websites on the command-line

This post describes practical ways to crawl websites and by working with sitemaps on the command-line. Sitemaps are particularly useful since they are made so that machines can more intelligently crawl the site. The post entails all necessary code snippets to optimize link discovery and filtering.

more ...

Filtering links to gather texts on the web

The issue with URLs and URIs

A Uniform Resource Locator (URL), colloquially termed a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifier (URI).

Both navigation on the Web and web crawling rely on the assumption that “the Web is a space in which resources are identified by Uniform Resource Identifiers (URIs).” (Berners-Lee et al., 2006) That being said, URLs cannot be expected to be entirely reliable. Especially as part of the Web 2.0 content …

more ...

Evaluation of date extraction tools for Python

Introduction

Although text is ubiquitous on the Web, extracting information from web pages can prove to be difficult, and an important problem remains as to the most efficient way to gather language data. Metadata extraction is part of data mining and knowledge extraction techniques. Dates are critical components since they are relevant both from a philological standpoint and in the context of information technology.

In most cases, immediately accessible data on retrieved webpages do not carry substantial or accurate information: neither the URL nor the server response provide a reliable way to date a web document, i.e. to find …

more ...

Evaluating scraping and text extraction tools for Python

Although text is ubiquitous on the Web, extracting information from web pages can prove to be difficult. They come in different shapes and sizes mostly because of the wide variety of platforms and content management systems, and not least because of varying reasons and diverging goals followed during web publication.

This wide variety of contexts and text genres leads to important design decisions during the collection of texts: should the tooling be adapted to particular news outlets or blogs that are targeted (which often amounts to the development of web scraping tools) or should the extraction be as generic as …

more ...


A module to extract date information from web pages

Description

Metadata extraction

Diverse content extraction and scraping techniques are routinely used on web document collections by companies and research institutions alike. Being able to better qualify the contents allows for insights based on metadata (e.g. content type, authors or categories), better bandwidth control (e.g. by knowing when webpages have been updated), or optimization of indexing (e.g. language-based heuristics, LRU cache, etc.).

In short, metadata extraction is useful for different kinds of purposes ranging from knowledge extraction and business intelligence to classification and refined visualizations. It is often necessary to fully parse the document or apply robust …

more ...

On the interest of social media corpora

Introduction

The necessity to study language use in computer-mediated communication (CMC) appears to be of common interest, as online communication is ubiquitous and raises a series of ethical, sociological, technological and technoscientific issues among the general public. The importance of linguistic studies on CMC is acknowledged beyond the researcher community, for example in forensic analysis, since evidence can be found online and traced back to its author.

In a South Park episode (“Fort Collins”, episode 6 season 20), a school girl performs “emoji analysis” to get information on the author of troll messages. Using the distribution of emojis, she concludes …

more ...

Bibliography

Work in progress towards a page listing (web) corpus linguistics references and resources.

Summary

Corpus Linguistics and Corpus Building

  • The Routledge Handbook of Corpus Linguistics, 1 ed., O’Keeffe, A. and McCarthy, M., Eds., London, New York: Routledge, 2010.
  • N. Bubenhofer, Einführung in die Korpuslinguistik: Praktische Grundlagen und Werkzeuge, Zürich:2009.
  • S. Loiseau, “Corpus, quantification et typologie textuelle”, Syntaxe et sémantique, vol. 9, pp. 73-85, 2008.
  • C. Draxler, Korpusbasierte Sprachverarbeitung, Günter Narr, 2008. M. Cori, “Des méthodes de traitement automatique aux linguistiques fondées sur les corpus”, Langages, vol. 171, iss. 3, pp. 95-110 …
more ...

Collection and indexing of tweets with a geographical focus

This paper introduces a Twitter corpus focused geographically in order to (1) test selection and collection processes for a given region and (2) find a suitable database to query, filter, and visualize the tweets.

Why Twitter?

To do linguistics on texts is to do botanics on a herbarium, and zoology on remains of more or less well-preserved animals.”

Faire de la linguistique sur des textes, c’est faire de la …

more ...

Analysis of the German Reddit corpus

I would like to present work on the major social bookmarking and microblogging platform Reddit, which I recently introduced at the NLP4CMC workshop 2015. The article published in the proceedings is available online: Collection, Description, and Visualization of the German Reddit Corpus.

Basic idea

The work described in the article directly follows from the recent release of the “Reddit comment corpus”: Reddit user Stuck In The Matrix (Jason Baumgartner) made the dataset publicly available on the platform archive.org at the beginning of July 2015 and claimed to have any publicly available comment.

Corpus construction

In order to focus on …

more ...