Here are a few links about interesting things that I recently read.
- An article co-authored by Daniel Lemire
about cleaning Project Gutenberg e-Books (i.e. removing
preambles and epilogues) using a statistical (and not a rule-based)
Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books by Owen Kaser and Daniel Lemire
- An article about XML compression techniques which details
several alternatives, including “XML conscious compressors”, i.e.
compressors which enable queries.
Investigate state-of-the-art XML compression techniques by Sherif Sakr
- A few SSH tricks, most notably the “Operating on Remote Files Locally” which explains how to use SSHFS to work on an “usual” directory with all files being stored online.
- Last but not least, regular compilations of links by the computational linguistics department of the University Paris 3, mostly about linguistics, tools, programming and web culture. Many of them in English, the others in French. See the HTML version of the documents.