Archives for Bits of Language

A module to extract date information from web pages

On the interest of social media corpora

Ad hoc and general-purpose corpus construction from web sources

Indexing text with ElasticSearch

Collection and indexing of tweets with a geographical focus

Distant reading and text visualization

Foucault and the spatial turn

Parsing and converting HTML documents to XML format using Python’s lxml

Analysis of the German Reddit corpus

Rule-based URL cleaning for text collections

Finding viable seed URLs for web corpora

Challenges in web corpus construction for low-resource languages

A one-pass valency-oriented chunker for German

Guessing if a URL points to a WordPress blog

Review of the Czech internet corpus

Overview of URL analysis and classification methods

Batch file conversion to the same encoding on Linux

Introducing the Microblog Explorer

What is good enough to become part of a web corpus?

Recipes for several model fitting techniques in R

Data analysis and modeling in R: a crash course

Blind reason, Leibniz and the age of cybernetics

A note on Computational Models of Psycholinguistics

Feeding the COW at the FU Berlin

Ludovic Tanguy on Visual Analysis of Linguistic Data

Review of the readability checker DeLite

Two open-source corpus-builders for German and French

On global vs. local visualization of readability

Gerolinguistics” and text comprehension

Microsoft to analyze social networks to determine comprehension level

Amazon’s readability statistics by example

2nd release of the German Political Speeches Corpus

XML standards for language corpora (review)

Completing web pages on the fly with JavaScript

Canadian research on readability in the ‘90s

Word lists, word frequency and contextual diversity

Interview with children’s books author Sabine Ludwig

My contribution to the Anglicism of the Year award

Tendencies in research on readability

Bibliography and links updates

A note on Amazon’s text readability stats

Using a rule-based tokenizer for German

Parallel work with two taggers

Find and delete LaTeX temporary files

Selected recent discoveries

Display long texts with CSS, tutorial and example

Introducing the German Political Speeches Corpus and Visualization Tool

About Google Reading Level

A few links on producing posters using LaTeX

Lord Kelvin, Bachelard and Dilbert on Measurement

Crawling a newspaper website to build a corpus

Building a basic specialized crawler

Workshop on Complexity in Language – Day 2 (report)

Workshop on Complexity in Language - Day 1 (report)

Halliday on complexity (1992)

Approaches to philosophy of technology

Simon, Gell-Mann and Lloyd on complex systems

Melanie Mitchell: defining and measuring complexity

Renate Bartsch on linguistic complexity

Philosophy of technology, how things started: a typology

Philosophy of technology: a few resources

Binary search to find words in a list: Perl tutorial

Resource links update

Quick review of the Falko Project

Having fun and making money doing research

Three series of recorded lectures

On Text Linguistics

E. Castello, Text Complexity and Reading Comprehension Tests - Reading Notes

Using and parsing the hCard microformat, an introduction

Commented bibliography on readability assessment

Comparison of Features for Automatic Readability Assessment: review

A short bibliography on Latent Semantic Analysis and Indexing

Building a topic-specific corpus out of two different corpora

Collecting academic papers


Why I don’t blog on and why I might do so (someday…)

A fast bash pipe for TreeTagger

Resources and links of interest