I recently had to deal with a series of files with different encodings in the same corpus, and I would like to share the solution I found in order to try to convert automatically all the files in a directory to the same encoding (here UTF-8).
file -i
I first tried to write a script in order to detect and correct the encoding, but it was everything but easy, so I decided to use UNIX software instead, assuming these tools would be adequate and robust enough.
I was not disappointed, as file
for example gives relevant information
when used with this syntax: file -i filename
. In fact, there are other
tools such as enca
, but I was luckier with this one.
input: file -i filename
output: filename: text/plain; charset=utf-8
grep -Po ‘…\K…’
First of all, one gets an answer of the kind
filename: text/plain; charset=utf-8
(if everything goes well), which
has to be parsed. In order to do this grep
is an option. The -P option
unlocks the power of Perl regular expressions, the -o option ensures
that only the match will be printed and not the whole line, and finally
the \K tells the interpreter to only select what comes afterwards (note
that -o and \K are redundant in this example).
So, in order to select the detected charset name and only it:
grep -Po 'charset=\K.+?$'
input: file -i $filename | grep -Po 'charset=\K.+?$
output: utf-8
The encoding is stored in a variable as the result of a command-line using a pipe:
encoding=$(file -i $filename | grep -Po 'charset=\K.+?$')
if [[ ! \$encoding =\~ "unknown" ]]
Before the re-encoding can take place, it is necessary to filter out the
cases where the file
could not identify the encoding (it happened to
me, but to a lesser extent than with enca
).
The exclamation mark used in an if
context transforms it into an
unless
statement, and the =~
operator attempts to find a match in a
string (here in the variable).
iconv
Once everything has been cleared, one may proceed to the proper
conversion, using iconv
. The encoding in the first file is specified
using the -f
switch, -t
is the target encoding (here Unicode).
iconv -f $encoding -t UTF-8 < $filename > $destfile
Note that UTF-8//TRANSLIT may also be used if there are too may errors, which should not be the case with an UTF-8 target but is necessary when converting to the ASCII format for example.
Last, there might be cases where the encoding could not be retrieved, that’s why the else clause is for, per byte re-encoding…
The whole trick
Here is the whole script:
#!/usr/bin/bash
for filename in dir/\* \# 'dir' should be changed...
do
encoding=\$(file -i \$filename | grep -Po ‘charset=\K.+?\$’)
destfile=”dir2/”\$filename # ‘dir2’ should also be
changed…
if [[ ! \$encoding =\~ “unknown” ]]
then
iconv -f \$encoding -t UTF-8 \< \$filename > \$destfile
else
# do something like using a conversion table to address
targeted problems
fi done
This bash script proved efficient and enabled me to homogenize my corpus. As far as I know, it runs quite fast and also saves time because one only has to focus on the problematic cases (which ought to be addressed anyway), the rest is taken care of.