I recently had to deal with a series of files with different encodings in the same corpus, and I would like to share the solution I found in order to try to convert automatically all the files in a directory to the same encoding (here UTF-8).
I first tried to write a script in order to detect and correct the encoding, but it was everything but easy, so I decided to use UNIX software instead, assuming these tools would be adequate and robust enough.
I was not disappointed, as
file for example gives relevant information
when used with this syntax:
file -i filename. In fact, there are other
tools such as
enca, but I was luckier with this one.
file -i filename
filename: text/plain; charset=utf-8
grep -Po ‘…\K…’
First of all, one gets an answer of the kind
filename: text/plain; charset=utf-8 (if everything goes well), which
has to be parsed. In order to do this
grep is an option. The -P option
unlocks the power of Perl regular expressions, the -o option ensures
that only the match will be printed and not the whole line, and finally
the \K tells the interpreter to only select what comes afterwards (note
that -o and \K are redundant in this example).
So, in order to select the detected charset name and only it:
grep -Po 'charset=\K.+?$'
file -i $filename | grep -Po 'charset=\K.+?$
The encoding is stored in a variable as the result of a command-line using a pipe:
encoding=$(file -i $filename | grep -Po 'charset=\K.+?$')
if [[ ! \$encoding =\~ "unknown" ]]
Before the re-encoding can take place, it is necessary to filter out the
cases where the
file could not identify the encoding (it happened to
me, but to a lesser extent than with
The exclamation mark used in an
if context transforms it into an
unless statement, and the
=~ operator attempts to find a match in a
string (here in the variable).
Once everything has been cleared, one may proceed to the proper
iconv. The encoding in the first file is specified
-t is the target encoding (here Unicode).
iconv -f $encoding -t UTF-8 < $filename > $destfile
Note that UTF-8//TRANSLIT may also be used if there are too may errors, which should not be the case with an UTF-8 target but is necessary when converting to the ASCII format for example.
Last, there might be cases where the encoding could not be retrieved, that’s why the else clause is for, per byte re-encoding…
The whole trick
Here is the whole script:
#!/usr/bin/bash for filename in dir/\* \# 'dir' should be changed... do
encoding=\$(file -i \$filename | grep -Po ‘charset=\K.+?\$’)
destfile=”dir2/”\$filename # ‘dir2’ should also be
if [[ ! \$encoding =\~ “unknown” ]] then iconv -f \$encoding -t UTF-8 \< \$filename > \$destfile else # do something like using a conversion table to address targeted problems fi done
This bash script proved efficient and enabled me to homogenize my corpus. As far as I know, it runs quite fast and also saves time because one only has to focus on the problematic cases (which ought to be addressed anyway), the rest is taken care of.