Bits of Language: corpus linguistics, NLP and text analytics

Filtering links to gather texts on the web

Courlan is a command-line tool and Python library designed to clean, filter, normalize, and sample URLs. Its primary purpose is to optimize web crawling by focusing on web pages containing primarily spam-free text in a target language.

more ...

Batch file conversion to the same encoding on Linux

I recently had to deal with a series of files with different encodings in the same corpus, and I would like to share the solution I found in order to try to convert automatically all the files in a directory to the same encoding (here UTF-8).

file -i

I first tried to write a script in order to detect and correct the encoding, but it was everything but easy, so I decided to use UNIX software instead, assuming these tools would be adequate and robust enough.

I was not disappointed, as file for example gives relevant information when used …

more ...