Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Readers > Sony Reader

Notices

Reply
 
Thread Tools Search this Thread
Old 01-07-2008, 07:06 AM   #1
alexxxm
Addict
alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.
 
Posts: 223
Karma: 356
Join Date: Aug 2007
Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...
Cleaning bad characters

In the same spirit as my previous post https://www.mobileread.com/forums/sho...479#post134479 ,
i.e. to write small utilities to make very definite actions, here is the code I use to make sure my books do not contain bad characters (bad=non-printable)

Code:
#!/usr/bin/perl -w

if($ARGV[0] eq "-l"){$list=1;$fin=$ARGV[1];$fout=$ARGV[2]}
else{$list=0;$fin=$ARGV[0];$fout=$ARGV[1]}

open(A,"<$fin");my @a=<A>; close(A);

if($list==1)
{
      my %ext;
      my $i=1;
      foreach $l(@a)
        {
                while ($l=~/([^\x20-\x7e\n\r])/g)
                {
                        $code=ord($1); $hcode = sprintf "%lx", $code;
                        $ext{$hcode}++;
                }
                $i++;
        }
        print"\n\nNon-printable characters, and their number of occurrences\n","-"x70,"\n";
        foreach $k (sort (keys %ext))
        {print "0x$k\t$ext{$k}\n"}
}
else
{
        open(B,">$fout");
        foreach $l(@a)
        {
                $l=~s/\x97/-/g;
                $l=~s/\x91/'/g;
                $l=~s/\x92/'/g;
                $l=~s/\x93/"/g;
                $l=~s/\x94/"/g;
                print(B "$l");
        }
        close(B);
}
save it to some name (e.g. correct_nonascii.pl) and run it as:
correct_nonascii.pl [-l] filenamein filenameout

when run with the -l switch it will list how many occurrences for each non-printable char you have.
When run without it, it runs according to the substitution table, which you can extend at will.
According to the example line:
$l=~s/\x97/-/g;
you substitute the char having hex code 0x97 (a long "-" sign, happens often) with the usual "-" char.

Use the -l switch at first, to scan for problems, then check on a good ASCII table.

Alessandro
alexxxm is offline   Reply With Quote
Old 01-07-2008, 07:56 AM   #2
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Quote:
Originally Posted by alexxxm View Post
According to the example line:
$l=~s/\x97/-/g;
you substitute the char having hex code 0x97 (a long "-" sign, happens often) with the usual "-" char.
PLEASE don't replace the "em-dash" with a hypen. They are completely different characters, and are used for different purposes. I hate it when people incorrectly use a hyphen when they really mean to use an em-dash.
HarryT is offline   Reply With Quote
Advert
Old 01-07-2008, 08:45 AM   #3
alexxxm
Addict
alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.
 
Posts: 223
Karma: 356
Join Date: Aug 2007
Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...
Quote:
Originally Posted by HarryT View Post
PLEASE don't replace the "em-dash" with a hypen. They are completely different characters, and are used for different purposes. I hate it when people incorrectly use a hyphen when they really mean to use an em-dash.
... the purpose being???
I'm very curious, since that damned 0X97 always displays wrong on my terminal.

Alessandro
alexxxm is offline   Reply With Quote
Old 01-07-2008, 08:49 AM   #4
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,931
Karma: 128903250
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Not very useful at this point as long as the em dash is being removed in place of a dash. It took me a while to sort out that problem with Book Designer and here you go causing the same problem.
JSWolf is offline   Reply With Quote
Old 01-07-2008, 08:54 AM   #5
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
The em-dash is used to indicate a "pause" in a sentence. Eg, from the book I'm working on at the moment:

Quote:
Their eyes met, and something—something unspoken but cogent—passed between them.
The hyphen is simply used to join loosely-connected words (as I've just used it there), or to break a word at the end of a line.

Hyphens and dashes are gramatically completely different things.
HarryT is offline   Reply With Quote
Advert
Old 01-07-2008, 09:20 AM   #6
akiburis
Connoisseur
akiburis will become famous soon enoughakiburis will become famous soon enoughakiburis will become famous soon enoughakiburis will become famous soon enoughakiburis will become famous soon enoughakiburis will become famous soon enough
 
Posts: 66
Karma: 614
Join Date: Jul 2007
Location: New York
Device: Sony PRS-505, iLiad Book Edition
The script also appears to be meant to replace right and left quotation marks (x91 to x94) with straight quotes. Apparently, what alexxxm means by nonprintable characters is characters outside the basic ASCII set. So the script would do the opposite of what most people want to do in converting a plain text file for typesetting or formatting as an ebook.
akiburis is offline   Reply With Quote
Old 01-07-2008, 09:27 AM   #7
igorsk
Wizard
igorsk ought to be getting tired of karma fortunes by now.igorsk ought to be getting tired of karma fortunes by now.igorsk ought to be getting tired of karma fortunes by now.igorsk ought to be getting tired of karma fortunes by now.igorsk ought to be getting tired of karma fortunes by now.igorsk ought to be getting tired of karma fortunes by now.igorsk ought to be getting tired of karma fortunes by now.igorsk ought to be getting tired of karma fortunes by now.igorsk ought to be getting tired of karma fortunes by now.igorsk ought to be getting tired of karma fortunes by now.igorsk ought to be getting tired of karma fortunes by now.
 
Posts: 3,442
Karma: 300001
Join Date: Sep 2006
Location: Belgium
Device: PRS-500/505/700, Kindle, Cybook Gen3, Words Gear
Everything you wanted and didn't want to know about dashes and hyphens.
http://en.wikipedia.org/wiki/Dash#Hyphen
igorsk is offline   Reply With Quote
Old 01-07-2008, 09:33 AM   #8
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Quote:
Originally Posted by akiburis View Post
The script also appears to be meant to replace right and left quotation marks (x91 to x94) with straight quotes. Apparently, what alexxxm means by nonprintable characters is characters outside the basic ASCII set. So the script would do the opposite of what most people want to do in converting a plain text file for typesetting or formatting as an ebook.
Yes, as you rightly say, this is very opposite of what most of us are trying to achieve, which is to add "richer" content back into a plain ASCII text file!

As long the eBook is in some format which correctly indicates its code page (as, for example, a MobiPocket book does) it should display correctly on any computer.
HarryT is offline   Reply With Quote
Old 01-07-2008, 09:41 AM   #9
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,931
Karma: 128903250
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Basically what this script does is convert the reading experience into something not as nice as it would have been. I personally prefer the culy quotes and apostrophes. Also, when I read something that I can easily tell there is supposed to be an em dash and it's a regular dash, that annoys.

What I want to know is ... what is the purpose of this script? Is it to strip the reading experience from the text?
JSWolf is offline   Reply With Quote
Old 01-07-2008, 02:53 PM   #10
alexxxm
Addict
alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.
 
Posts: 223
Karma: 356
Join Date: Aug 2007
Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...
wow, I'm really sorry having submitted this...
richer as it maybe the experience of using nicely curled quotes and stuff like that, I often found in the past - it often happened on the Nokia N700 and on the Rocket also - that many high-end ASCII chars were not recognized, often being displayed as gray boxes.
That's the reason I wrote it.
I use it by routine with the -l switch to check for problems, and if I see the occasional cedilla instead of "c" - and it's not a spanish text - I can always add the translation in the table to switch it to "c".
The proposed table was just that: an example (the last I used).
I didnt intend to offend anyone by suggesting to declass your reading experience.
Simply put, many texts often are full of wrongly scanned chars and my utility takes care of that.

... now if only there was a way to let this thread disappear from the forum...

Alessandro
alexxxm is offline   Reply With Quote
Old 01-07-2008, 07:37 PM   #11
Patricia
Reader
Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.
 
Patricia's Avatar
 
Posts: 11,504
Karma: 8720163
Join Date: May 2007
Location: South Wales, UK
Device: Sony PRS-500, PRS-505, Asus EEEpc 4G
Don't be discouraged, Alessandro. This is a popular forum and different members have different preferences. I'm sure that some people will be glad to have your utility.
Patricia is offline   Reply With Quote
Old 01-07-2008, 08:41 PM   #12
Nogg
Literate!
Nogg knows what time it isNogg knows what time it isNogg knows what time it isNogg knows what time it isNogg knows what time it isNogg knows what time it isNogg knows what time it isNogg knows what time it isNogg knows what time it isNogg knows what time it isNogg knows what time it is
 
Posts: 256
Karma: 2247
Join Date: Mar 2007
Device: PRS-500
I agree with Patricia, please don't let this stop you from continuing to post. The script you have up there is very easily configured to suit anyone's needs and I plan on adjusting the table to work for me right away.

To Harry: I'm with you in terms of making sure that grammar rules are followed, but I recently read a book where somehow the em-dashes were replaced by quotation marks and I would have gladly taken hyphens in that case. I stumbled every time I hit one of those.
Nogg is offline   Reply With Quote
Old 01-07-2008, 10:28 PM   #13
akiburis
Connoisseur
akiburis will become famous soon enoughakiburis will become famous soon enoughakiburis will become famous soon enoughakiburis will become famous soon enoughakiburis will become famous soon enoughakiburis will become famous soon enough
 
Posts: 66
Karma: 614
Join Date: Jul 2007
Location: New York
Device: Sony PRS-505, iLiad Book Edition
Let me try to explain, Alessandro. My post (like, I'm sure, Harry's and Jon's) had nothing to do with wanting to discourage you from posting, or with feeling offended by or wanting to disparage your preferences, or with the quality of your Perl scripting. It was based on a real misunderstanding. As the merest amateur dabbler in Perl (or any sort of coding), I could see what your script is meant to do and how it does it. So, as far as I'm competent to judge such things at all, it's a really good, clear, efficient script. But I didn't understand why you would write a script to do that. Actually, I still don't. Why do you need to format your ebooks for a device or an application that can't properly render extended-ASCII or Unicode character encodings? Why torture yourself? I really would like to discourage you from doing that!

Last edited by akiburis; 01-07-2008 at 11:15 PM. Reason: Misspelling corrected.
akiburis is offline   Reply With Quote
Old 01-08-2008, 02:20 AM   #14
alexxxm
Addict
alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.alexxxm has a complete set of Star Wars action figures.
 
Posts: 223
Karma: 356
Join Date: Aug 2007
Device: Rocket; Hiebook; N700; Sony 505; Kindle DX ...
Quote:
Originally Posted by akiburis View Post
Why do you need to format your ebooks for a device or an application that can't properly render extended-ASCII or Unicode character encodings? Why torture yourself? I really would like to discourage you from doing that!
Thank you for the understanding, akiburis.
Really, it wasnt a torture, I enjoy coding... although I felt heavily disappointed by some reactions.

Of course my 505 can easily render extended-ASCII and Unicode character encodings.

The reason I wrote it is that I work often with books which are not in formats like those mentioned by HarryT, (i.e.... some format which correctly indicates its code page as, for example, a MobiPocket book does...). This often happens in plain .txt files, not indicating any code page at all.

I wouldnt think at all using it on an already beautifully formatted moby book - the problems arise from just-scanned texts. A simple run with the [-l] switch tells me at once if there are any problems - just then I know what/if to do anything to correct them.

Alessandro
alexxxm is offline   Reply With Quote
Old 01-08-2008, 04:21 AM   #15
astra
The Introvert
astra ought to be getting tired of karma fortunes by now.astra ought to be getting tired of karma fortunes by now.astra ought to be getting tired of karma fortunes by now.astra ought to be getting tired of karma fortunes by now.astra ought to be getting tired of karma fortunes by now.astra ought to be getting tired of karma fortunes by now.astra ought to be getting tired of karma fortunes by now.astra ought to be getting tired of karma fortunes by now.astra ought to be getting tired of karma fortunes by now.astra ought to be getting tired of karma fortunes by now.astra ought to be getting tired of karma fortunes by now.
 
astra's Avatar
 
Posts: 8,307
Karma: 1000077497
Join Date: Jan 2007
Location: United Kingdom
Device: Sony Reader PRS-650 & 505 & 500
Regarding hypen and em-dashes.
I have encountered the same problem.

However, lately, I have noticed in some books there are no em-dashes, instead there are hypens with spaces before and after the hypen.
If I use Harry_T's example:

Their eyes met, and something—something unspoken but cogent—passed between them.

it is perfectly fine if you edit it like that:

Their eyes met, and something - something unspoken but cogent - passed between them.

Just don't replace em-dashes with hypen without spaces on either side of the hypen.
I am reading a hardback edition of a book right now and that's what I have in the book. No em-dashes at all.
astra is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Book titles show up with Bad characters Imatechie2006 Calibre 0 01-20-2010 06:18 PM
Expert help required : Cleaning bad pdf scans Student1 PDF 12 03-03-2009 05:57 AM
Cleaning screen brontus Sony Reader 7 02-15-2009 05:38 PM
Cleaning the reader pilotbob Sony Reader 19 11-27-2007 05:41 PM


All times are GMT -4. The time now is 03:34 AM.


MobileRead.com is a privately owned, operated and funded community.