Batch file conversion to the same encoding on Linux

I recently had to deal with a series of files with different encodings in the same corpus, and I would like to share the solution I found in order to try to convert automatically all the files in a directory to the same encoding (here UTF-8).

file -i

I first tried to write a script in order to detect and correct the encoding, but it was everything but easy, so I decided to use UNIX software instead, assuming these tools would be adequate and robust enough.

I was not disappointed, as file for example gives relevant information when used with this syntax: file -i filename. In fact, there are other tools such as enca, but I was luckier with this one.

input: file -i filename
output: filename: text/plain; charset=utf-8

grep -Po ‘…\K…’

First of all, one gets an answer of the kind filename: text/plain; charset=utf-8 (if everything goes well), which has to be parsed. In order to do this grep is an option. The -P option unlocks the power of Perl regular expressions, the -o option ensures that only the match will be printed and not the whole line, and finally the \K tells the …

more ...