I recently had to deal with a series of files with different encodings in the same corpus, and I would like to share the solution I found in order to try to convert automatically all the files in a directory to the same encoding (here UTF-8).
I first tried to write a script in order to detect and correct the encoding, but it was everything but easy, so I decided to use UNIX software instead, assuming these tools would be adequate and robust enough.
I was not disappointed, as
file for example gives relevant information
when used with this syntax:
file -i filename. In fact, there are other
tools such as
enca, but I was luckier with this one.
file -i filename
filename: text/plain; charset=utf-8
grep -Po ‘…\K…’
First of all, one gets an answer of the kind
filename: text/plain; charset=utf-8 (if everything goes well), which
has to be parsed. In order to do this
grep is an option. The -P option
unlocks the power of Perl regular expressions, the -o option ensures
that only the match will be printed and not the whole line, and finally
the \K tells the …