I recently had to deal with a series of files with different encodings in the same corpus, and I would like to share the solution I found in order to try to convert automatically all the files in a directory to the same encoding (here UTF-8).

file -i

I first tried to write a script in order to detect and correct the encoding, but it was everything but easy, so I decided to use UNIX software instead, assuming these tools would be adequate and robust enough.

I was not disappointed, as file for example gives relevant information when used with this syntax: file -i filename. In fact, there are other tools such as enca, but I was luckier with this one.

input: file -i filename
output: filename: text/plain; charset=utf-8

grep -Po ‘…\K…’

First of all, one gets an answer of the kind filename: text/plain; charset=utf-8 (if everything goes well), which has to be parsed. In order to do this grep is an option. The -P option unlocks the power of Perl regular expressions, the -o option ensures that only the match will be printed and not the whole line, and finally the \K tells the interpreter to only select what comes afterwards (note that -o and \K are redundant in this example).

So, in order to select the detected charset name and only it: grep -Po 'charset=\K.+?$'

input: file -i $filename | grep -Po 'charset=\K.+?$
output: utf-8

The encoding is stored in a variable as the result of a command-line using a pipe:

encoding=$(file -i $filename | grep -Po 'charset=\K.+?$')

if [[ ! \$encoding =\~ "unknown" ]]

Before the re-encoding can take place, it is necessary to filter out the cases where the file could not identify the encoding (it happened to me, but to a lesser extent than with enca).

The exclamation mark used in an if context transforms it into an unless statement, and the =~ operator attempts to find a match in a string (here in the variable).

iconv

Once everything has been cleared, one may proceed to the proper conversion, using iconv. The encoding in the first file is specified using the -f switch, -t is the target encoding (here Unicode).

iconv -f $encoding -t UTF-8 < $filename > $destfile

Note that UTF-8//TRANSLIT may also be used if there are too may errors, which should not be the case with an UTF-8 target but is necessary when converting to the ASCII format for example.

Last, there might be cases where the encoding could not be retrieved, that’s why the else clause is for, per byte re-encoding…

The whole trick

Here is the whole script:

#!/usr/bin/bash

for filename in dir/\*        \# 'dir' should be changed...
do

encoding=\$(file -i \$filename | grep -Po ‘charset=\K.+?\$’)   destfile=”dir2/”\$filename        # ‘dir2’ should also be changed…
  if [[ ! \$encoding =\~ “unknown” ]]   then   iconv -f \$encoding -t UTF-8 \< \$filename > \$destfile   else   # do something like using a conversion table to address targeted problems   fi done

This bash script proved efficient and enabled me to homogenize my corpus. As far as I know, it runs quite fast and also saves time because one only has to focus on the problematic cases (which ought to be addressed anyway), the rest is taken care of.