View Single Post
Old 01-07-2008, 07:06 AM   #1
alexxxm
Addict
alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.
 
Posts: 223
Karma: 356
Join Date: Aug 2007
Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...
Cleaning bad characters

In the same spirit as my previous post https://www.mobileread.com/forums/sho...479#post134479 ,
i.e. to write small utilities to make very definite actions, here is the code I use to make sure my books do not contain bad characters (bad=non-printable)

Code:
#!/usr/bin/perl -w

if($ARGV[0] eq "-l"){$list=1;$fin=$ARGV[1];$fout=$ARGV[2]}
else{$list=0;$fin=$ARGV[0];$fout=$ARGV[1]}

open(A,"<$fin");my @a=<A>; close(A);

if($list==1)
{
      my %ext;
      my $i=1;
      foreach $l(@a)
        {
                while ($l=~/([^\x20-\x7e\n\r])/g)
                {
                        $code=ord($1); $hcode = sprintf "%lx", $code;
                        $ext{$hcode}++;
                }
                $i++;
        }
        print"\n\nNon-printable characters, and their number of occurrences\n","-"x70,"\n";
        foreach $k (sort (keys %ext))
        {print "0x$k\t$ext{$k}\n"}
}
else
{
        open(B,">$fout");
        foreach $l(@a)
        {
                $l=~s/\x97/-/g;
                $l=~s/\x91/'/g;
                $l=~s/\x92/'/g;
                $l=~s/\x93/"/g;
                $l=~s/\x94/"/g;
                print(B "$l");
        }
        close(B);
}
save it to some name (e.g. correct_nonascii.pl) and run it as:
correct_nonascii.pl [-l] filenamein filenameout

when run with the -l switch it will list how many occurrences for each non-printable char you have.
When run without it, it runs according to the substitution table, which you can extend at will.
According to the example line:
$l=~s/\x97/-/g;
you substitute the char having hex code 0x97 (a long "-" sign, happens often) with the usual "-" char.

Use the -l switch at first, to scan for problems, then check on a good ASCII table.

Alessandro
alexxxm is offline   Reply With Quote