Batch file conversion to the same encoding on Linux
I recently had to deal with a series of files with different encodings in the same corpus, and I would like to share the solution I found in order to try to convert automatically all the files in a directory to the same encoding (here UTF-8).
file -i
I first tried to write a script in order to detect and correct the encoding, but it was everything but easy, so I decided to use UNIX software instead, assuming these tools would be adequate and robust enough.
I was not disappointed, as file
for example gives relevant information
when used with this syntax: file -i filename
. In fact, there are other
tools such as enca
, but I was luckier with this one.
input: file -i filename
output: filename: text/plain; charset=utf-8
grep -Po ‘…\K…’
First of all, one gets an answer of the kind
filename: text/plain; charset=utf-8
(if everything goes well), which
has to be parsed. In order to do this grep
is an option. The -P option
unlocks the power of Perl regular expressions, the -o option ensures
that only the match will be printed and not the whole line, and finally
the \K tells the …