|  01-07-2008, 07:06 AM | #1 | 
| Addict     Posts: 223 Karma: 356 Join Date: Aug 2007 Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ... | 
				
				Cleaning bad characters
			 
			
			In the same spirit as my previous post https://www.mobileread.com/forums/sho...479#post134479 ,  i.e. to write small utilities to make very definite actions, here is the code I use to make sure my books do not contain bad characters (bad=non-printable) Code: #!/usr/bin/perl -w
if($ARGV[0] eq "-l"){$list=1;$fin=$ARGV[1];$fout=$ARGV[2]}
else{$list=0;$fin=$ARGV[0];$fout=$ARGV[1]}
open(A,"<$fin");my @a=<A>; close(A);
if($list==1)
{
      my %ext;
      my $i=1;
      foreach $l(@a)
        {
                while ($l=~/([^\x20-\x7e\n\r])/g)
                {
                        $code=ord($1); $hcode = sprintf "%lx", $code;
                        $ext{$hcode}++;
                }
                $i++;
        }
        print"\n\nNon-printable characters, and their number of occurrences\n","-"x70,"\n";
        foreach $k (sort (keys %ext))
        {print "0x$k\t$ext{$k}\n"}
}
else
{
        open(B,">$fout");
        foreach $l(@a)
        {
                $l=~s/\x97/-/g;
                $l=~s/\x91/'/g;
                $l=~s/\x92/'/g;
                $l=~s/\x93/"/g;
                $l=~s/\x94/"/g;
                print(B "$l");
        }
        close(B);
}correct_nonascii.pl [-l] filenamein filenameout when run with the -l switch it will list how many occurrences for each non-printable char you have. When run without it, it runs according to the substitution table, which you can extend at will. According to the example line: $l=~s/\x97/-/g; you substitute the char having hex code 0x97 (a long "-" sign, happens often) with the usual "-" char. Use the -l switch at first, to scan for problems, then check on a good ASCII table. Alessandro | 
|   |   | 
|  01-07-2008, 07:56 AM | #2 | 
| eBook Enthusiast            Posts: 85,560 Karma: 93980341 Join Date: Nov 2006 Location: UK Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6 | 
			
			PLEASE don't replace the "em-dash" with a hypen. They are completely different characters, and are used for different purposes. I hate it when people incorrectly use a hyphen when they really mean to use an em-dash.
		 | 
|   |   | 
|  01-07-2008, 08:45 AM | #3 | |
| Addict     Posts: 223 Karma: 356 Join Date: Aug 2007 Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ... | Quote: 
 I'm very curious, since that damned 0X97 always displays wrong on my terminal. Alessandro | |
|   |   | 
|  01-07-2008, 08:49 AM | #4 | 
| Resident Curmudgeon            Posts: 80,665 Karma: 150249619 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3 | 
			
			Not very useful at this point as long as the em dash is being removed in place of a dash. It took me a while to sort out that problem with Book Designer and here you go causing the same problem.
		 | 
|   |   | 
|  01-07-2008, 08:54 AM | #5 | |
| eBook Enthusiast            Posts: 85,560 Karma: 93980341 Join Date: Nov 2006 Location: UK Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6 | 
			
			The em-dash is used to indicate a "pause" in a sentence. Eg, from the book I'm working on at the moment: Quote: 
 Hyphens and dashes are gramatically completely different things. | |
|   |   | 
|  01-07-2008, 09:20 AM | #6 | 
| Connoisseur       Posts: 66 Karma: 614 Join Date: Jul 2007 Location: New York Device: Sony PRS-505, iLiad Book Edition | 
			
			The script also appears to be meant to replace right and left quotation marks (x91 to x94) with straight quotes. Apparently, what alexxxm means by nonprintable characters is characters outside the basic ASCII set. So the script would do the opposite of what most people want to do in converting a plain text file for typesetting or formatting as an ebook.
		 | 
|   |   | 
|  01-07-2008, 09:27 AM | #7 | 
| Wizard            Posts: 3,442 Karma: 300001 Join Date: Sep 2006 Location: Belgium Device: PRS-500/505/700, Kindle, Cybook Gen3, Words Gear | 
			
			Everything you wanted and didn't want to know about dashes and hyphens. http://en.wikipedia.org/wiki/Dash#Hyphen | 
|   |   | 
|  01-07-2008, 09:33 AM | #8 | |
| eBook Enthusiast            Posts: 85,560 Karma: 93980341 Join Date: Nov 2006 Location: UK Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6 | Quote: 
 As long the eBook is in some format which correctly indicates its code page (as, for example, a MobiPocket book does) it should display correctly on any computer. | |
|   |   | 
|  01-07-2008, 09:41 AM | #9 | 
| Resident Curmudgeon            Posts: 80,665 Karma: 150249619 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3 | 
			
			Basically what this script does is convert the reading experience into something not as nice as it would have been. I personally prefer the culy quotes and apostrophes. Also, when I read something that I can easily tell there is supposed to be an em dash and it's a regular dash, that annoys. What I want to know is ... what is the purpose of this script? Is it to strip the reading experience from the text? | 
|   |   | 
|  01-07-2008, 02:53 PM | #10 | 
| Addict     Posts: 223 Karma: 356 Join Date: Aug 2007 Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ... | 
			
			wow, I'm really sorry having submitted this... richer as it maybe the experience of using nicely curled quotes and stuff like that, I often found in the past - it often happened on the Nokia N700 and on the Rocket also - that many high-end ASCII chars were not recognized, often being displayed as gray boxes. That's the reason I wrote it. I use it by routine with the -l switch to check for problems, and if I see the occasional cedilla instead of "c" - and it's not a spanish text - I can always add the translation in the table to switch it to "c". The proposed table was just that: an example (the last I used). I didnt intend to offend anyone by suggesting to declass your reading experience. Simply put, many texts often are full of wrongly scanned chars and my utility takes care of that. ... now if only there was a way to let this thread disappear from the forum... Alessandro | 
|   |   | 
|  01-07-2008, 07:37 PM | #11 | 
| Reader            Posts: 11,504 Karma: 8720163 Join Date: May 2007 Location: South Wales, UK Device: Sony PRS-500, PRS-505, Asus EEEpc 4G | 
			
			Don't be discouraged, Alessandro. This is a popular forum and different members have different preferences. I'm sure that some people will be glad to have your utility.
		 | 
|   |   | 
|  01-07-2008, 08:41 PM | #12 | 
| Literate!            Posts: 256 Karma: 2247 Join Date: Mar 2007 Device: PRS-500 | 
			
			I agree with Patricia, please don't let this stop you from continuing to post.  The script you have up there is very easily configured to suit anyone's needs and I plan on adjusting the table to work for me right away.   To Harry: I'm with you in terms of making sure that grammar rules are followed, but I recently read a book where somehow the em-dashes were replaced by quotation marks and I would have gladly taken hyphens in that case.  I stumbled every time I hit one of those. | 
|   |   | 
|  01-07-2008, 10:28 PM | #13 | 
| Connoisseur       Posts: 66 Karma: 614 Join Date: Jul 2007 Location: New York Device: Sony PRS-505, iLiad Book Edition | 
			
			Let me try to explain, Alessandro. My post (like, I'm sure, Harry's and Jon's) had nothing to do with wanting to discourage you from posting, or with feeling offended by or wanting to disparage your preferences, or with the quality of your Perl scripting. It was based on a real misunderstanding. As the merest amateur dabbler in Perl (or any sort of coding), I could see what your script is meant to do and how it does it. So, as far as I'm competent to judge such things at all, it's a really good, clear, efficient script. But I didn't understand why you would write a script to do that. Actually, I still don't. Why do you need to format your ebooks for a device or an application that can't properly render extended-ASCII or Unicode character encodings? Why torture yourself? I really would like to discourage you from doing that!
		 Last edited by akiburis; 01-07-2008 at 11:15 PM. Reason: Misspelling corrected. | 
|   |   | 
|  01-08-2008, 02:20 AM | #14 | |
| Addict     Posts: 223 Karma: 356 Join Date: Aug 2007 Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ... | Quote: 
 Really, it wasnt a torture, I enjoy coding... although I felt heavily disappointed by some reactions. Of course my 505 can easily render extended-ASCII and Unicode character encodings. The reason I wrote it is that I work often with books which are not in formats like those mentioned by HarryT, (i.e.... some format which correctly indicates its code page as, for example, a MobiPocket book does...). This often happens in plain .txt files, not indicating any code page at all. I wouldnt think at all using it on an already beautifully formatted moby book - the problems arise from just-scanned texts. A simple run with the [-l] switch tells me at once if there are any problems - just then I know what/if to do anything to correct them. Alessandro | |
|   |   | 
|  01-08-2008, 04:21 AM | #15 | 
| The Introvert            Posts: 8,307 Karma: 1000077497 Join Date: Jan 2007 Location: United Kingdom Device: Sony Reader PRS-650 & 505 & 500 | 
			
			Regarding hypen and em-dashes. I have encountered the same problem. However, lately, I have noticed in some books there are no em-dashes, instead there are hypens with spaces before and after the hypen. If I use Harry_T's example: Their eyes met, and something—something unspoken but cogent—passed between them. it is perfectly fine if you edit it like that: Their eyes met, and something - something unspoken but cogent - passed between them. Just don't replace em-dashes with hypen without spaces on either side of the hypen. I am reading a hardback edition of a book right now and that's what I have in the book. No em-dashes at all. | 
|   |   | 
|  | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Book titles show up with Bad characters | Imatechie2006 | Calibre | 0 | 01-20-2010 06:18 PM | 
| Expert help required : Cleaning bad pdf scans | Student1 | 12 | 03-03-2009 05:57 AM | |
| Cleaning screen | brontus | Sony Reader | 7 | 02-15-2009 05:38 PM | 
| Cleaning the reader | pilotbob | Sony Reader | 19 | 11-27-2007 05:41 PM |