Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 06-03-2007, 11:56 PM   #1
Raventhon
Member
Raventhon began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Jun 2007
Device: Amazon Kindle 3
Quick Reformatting of Terrible E-Books

I've been thinking, and as I don't actually know perl myself I'm unable to write the script for it, but a script that did the following would be amazingly useful in formatting eBooks for viewing on mobile devices.

Scan all files in a directory (and subdirectories, hey, why not) and replace all instances of <newline> not immediately followed by either <newline> or <tab> with a single space.

Reasoning behind this: I've seen entirely too many eBooks formatted such that they use manual line breaks instead of using the word wrap feature, and when transferred to a mobile device, you end up with fragmented lines:

"This is a bunch of text serving as example of
incorrect word
wrap due to stupid formatting of eBooks. I really
wish there was
some way to fix it, because it's almost impossible
to read this
terribly formatted text."

Can anyone think of any files they have that this command would damage? I wouldn't want to run it on poetry, but other than that, it seems that this script can be safely run on normally-formatted eBooks without changing anything.
Raventhon is offline   Reply With Quote
Old 06-04-2007, 12:00 AM   #2
Raventhon
Member
Raventhon began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Jun 2007
Device: Amazon Kindle 3
Note to self: Check previous posts before making new post. Your question may already have been discussed.
Raventhon is offline   Reply With Quote
Old 06-04-2007, 12:15 AM   #3
mogui
eNigma
mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.
 
mogui's Avatar
 
Posts: 503
Karma: 1335
Join Date: Dec 2006
Location: The Philippines
Device: HTC G1 Android FBReader
Previous threads on scripting

Sometimes it is hard to know what to search for. I am familiar with this thread and this one too that discuss scripting and handling the text-formatting problem that concerns you.

I hope this helps.
mogui is offline   Reply With Quote
Old 06-05-2007, 11:46 AM   #4
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 35,983
Karma: 17083916
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Sony Reader PRS-650, iPad, nook STR
WYSIWG Editor is broken

Last edited by JSWolf; 06-05-2007 at 11:49 AM. Reason: WYSIWG Editor is broken
JSWolf is offline   Reply With Quote
Old 06-05-2007, 11:48 AM   #5
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 35,983
Karma: 17083916
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Sony Reader PRS-650, iPad, nook STR
Quote:
Originally Posted by Raventhon View Post
I've been thinking, and as I don't actually know perl myself I'm unable to write the script for it, but a script that did the following would be amazingly useful in formatting eBooks for viewing on mobile devices.

Scan all files in a directory (and subdirectories, hey, why not) and replace all instances of <newline> not immediately followed by either <newline> or <tab> with a single space.

Reasoning behind this: I've seen entirely too many eBooks formatted such that they use manual line breaks instead of using the word wrap feature, and when transferred to a mobile device, you end up with fragmented lines:

"This is a bunch of text serving as example of
incorrect word
wrap due to stupid formatting of eBooks. I really
wish there was
some way to fix it, because it's almost impossible
to read this
terribly formatted text."

Can anyone think of any files they have that this command would damage? I wouldn't want to run it on poetry, but other than that, it seems that this script can be safely run on normally-formatted eBooks without changing anything.
Are you talking about purchased, "downloaded", or books from sites like Project Gutenberg?
JSWolf is offline   Reply With Quote
Old 06-21-2007, 11:13 AM   #6
Patricia
Reader
Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.
 
Patricia's Avatar
 
Posts: 11,520
Karma: 1728458
Join Date: May 2007
Location: South Wales, UK
Device: Sony PRS-500, PRS-505, Asus EEEpc 4G
With Project Gutenberg books in text file format, I just paste them into a word document, then run Stingo's Macro. This only takes a couple of minutes and solves the hard carriage breaks.
Patricia is offline   Reply With Quote
Old 06-21-2007, 02:59 PM   #7
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 35,983
Karma: 17083916
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Sony Reader PRS-650, iPad, nook STR
Quote:
Originally Posted by Patricia View Post
With Project Gutenberg books in text file format, I just paste them into a word document, then run Stingo's Macro. This only takes a couple of minutes and solves the hard carriage breaks.
If there is an HTML version available, I go for that one. You'll get images if there are any, and italics. Its not hard to work with the HTML in Book Designer. if you use the text file instead, you lose what attributes and images there might be. So please use the HTML when one exists.
JSWolf is offline   Reply With Quote
Old 06-21-2007, 04:46 PM   #8
nekokami
fruminous edugeek
nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.nekokami ought to be getting tired of karma fortunes by now.
 
nekokami's Avatar
 
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
Stingo's macro just looks for double paragraph marks, doesn't it? Won't help if you have a file that doesn't have an extra line between paragraphs (as often happens with files that have been through a PDF stage somewhere in their history). I've been thinking of writing a perl script to make a "best guess" based on line length. I'll be doing some perl work this summer, and may have a chance to slip it in then. I'll post it somewhere on mobileread (in the wiki, maybe) if I get a reasonable version working.
nekokami is offline   Reply With Quote
Old 06-21-2007, 08:47 PM   #9
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 35,983
Karma: 17083916
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Sony Reader PRS-650, iPad, nook STR
With the HTML from PG, there is no need to have to reformat it to remove the extra line spaces. It works just fine in BD as is. And if there are line spaces, they are meant to be there.
JSWolf is offline   Reply With Quote
Old 06-21-2007, 11:22 PM   #10
mogui
eNigma
mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.
 
mogui's Avatar
 
Posts: 503
Karma: 1335
Join Date: Dec 2006
Location: The Philippines
Device: HTC G1 Android FBReader
When designing scripts to deal with hard carriage returns, it is good to be able to actually see which character codes are causing the problem. A programmer's editor is the tool to start with for your basic research. You can read more here.
mogui is offline   Reply With Quote
Old 06-22-2007, 08:32 AM   #11
TadW
Uebermensch
TadW ought to be getting tired of karma fortunes by now.TadW ought to be getting tired of karma fortunes by now.TadW ought to be getting tired of karma fortunes by now.TadW ought to be getting tired of karma fortunes by now.TadW ought to be getting tired of karma fortunes by now.TadW ought to be getting tired of karma fortunes by now.TadW ought to be getting tired of karma fortunes by now.TadW ought to be getting tired of karma fortunes by now.TadW ought to be getting tired of karma fortunes by now.TadW ought to be getting tired of karma fortunes by now.TadW ought to be getting tired of karma fortunes by now.
 
TadW's Avatar
 
Posts: 2,580
Karma: 1094606
Join Date: Jul 2003
Location: Italy
Device: Kindle
If you deal with pre-formatting PG books, also check out their faq which provides some useful tips.

Quote:
There are some applications that specifically assist with auto-converting text into HTML:

* GutenMark http://www.sandroid.org/GutenMark was specifically written for the purpose, and knows enough about PG conventions to do a very good job.

* InterParse http://www.interparse.com is a Windows-based generic text parser that is very easy and intuitive to use.

* The World Wide Web Consortium lists some other options at http://www.w3.org/Tools/Misc_filters.html
TadW is offline   Reply With Quote
Old 06-22-2007, 11:17 AM   #12
mogui
eNigma
mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.mogui is no ebook tyro.
 
mogui's Avatar
 
Posts: 503
Karma: 1335
Join Date: Dec 2006
Location: The Philippines
Device: HTC G1 Android FBReader
Dealing with ugly line spacing

Let me give an example:
My favorite file format for the Reader is plain old ASCII text. The title on the Reader turns out to be the same as the filename. I like that. I can make text files from many other formats. The middle font size on the Reader is right for normal reading and then I can go one bigger if the lighting is bad. I don't have to experiment a lot when I am in a hurry to read something.

So I got a book in lit format and converted it to lrf. The resultant line formatting was just plain ugly. There were sentence fragments everywhere and way too many spaces between lines. I decided to tighten it up.

I used Amber lit converter (abclit) to convert the original lit file to text. Then I opened the file in PSPad (see earlier post for source). I used the hex display mode to examine the character structure of the ugliness. I noticed that there were $0d$0a pairs everywhere. That is a carriage return line feed combination.

But at the beginning of every real paragraph there was an $a0 character. That is a space character with the high order bit set, I don't know why anybody put that character in there. It is not common. But I liked it because it gave me a way to reformat everything easily.

First I used search and replace to find all the $0d$0a pairs and replace them with $20 (space). Then I replaced all the $a0 characters with $0d$0a pairs. The result was pure beauty! The paragraphs all flowed well and there were no unwanted line spaces.

It took five minutes!

Last edited by mogui; 06-22-2007 at 11:20 AM.
mogui is offline   Reply With Quote
Old 06-22-2007, 11:39 AM   #13
RWood
Technogeezer
RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.
 
RWood's Avatar
 
Posts: 7,233
Karma: 1596436
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
While MS Word (and OpenOffice) are nice tools for documents, nothing beats Ultra Edit for major work on raw text files. I use it frequently when preparing the Harvard Classics series of books. It is a commercial product; but, for me it has been worth it.
RWood is offline   Reply With Quote
Old 08-03-2007, 10:20 PM   #14
monkpalmer
Member
monkpalmer can teach chickens to fly.monkpalmer can teach chickens to fly.monkpalmer can teach chickens to fly.monkpalmer can teach chickens to fly.monkpalmer can teach chickens to fly.monkpalmer can teach chickens to fly.monkpalmer can teach chickens to fly.monkpalmer can teach chickens to fly.monkpalmer can teach chickens to fly.monkpalmer can teach chickens to fly.monkpalmer can teach chickens to fly.
 
monkpalmer's Avatar
 
Posts: 10
Karma: 3650
Join Date: Dec 2004
Device: Tungsten TC
Quote:
eBooks formatted such that they use manual line breaks instead of using the word wrap feature, and when transferred to a mobile device, you end up with fragmented lines
I use a freeware program called 'E-book Tidy' that fixes this. I use it for all my Gutenberg texts. It does more besides:

Quote:
Join Lines
Join Quotes
Split Lines at Page Width
Remove Blank Lines
Remove Extra Blank Lines
Add Carriage Return after Paragraph.
Trim Right Spaces
Indent Paragraphs
Unindent Paragraphs
Remove Numeric only lines
Delete initial numeric
Delete trailing numeric
Convert extended Ascii
Remove Extra Spaces
To Uppercase
To Lowercase
To Sentence Case
Invert Case
Convert Single to Double Quotes
Spell Check the document
It's available here:
http://www.simtel.net/product.php%5B...t_page%5D76296
monkpalmer is offline   Reply With Quote
Old 08-03-2007, 11:15 PM   #15
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 35,983
Karma: 17083916
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Sony Reader PRS-650, iPad, nook STR
Quote:
Originally Posted by RWood View Post
While MS Word (and OpenOffice) are nice tools for documents, nothing beats Ultra Edit for major work on raw text files. I use it frequently when preparing the Harvard Classics series of books. It is a commercial product; but, for me it has been worth it.
What can Ultra Edit do that Word cannot or what can Ultra Edit do better?

I've been thinking of trying to find a better text editor then Notepad.
JSWolf is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Anti-recommendations: Read any terrible books lately? ficbot Reading Recommendations 82 01-26-2011 01:09 PM
need a quick lesson how how to download and read e-books. clear General Discussions 9 10-10-2010 05:28 PM
Classic Quick question - library books Thrasher Barnes & Noble NOOK 6 06-23-2010 01:11 PM
quick question regarding removing books oncdoc Amazon Kindle 2 07-26-2009 09:53 PM
connect store downloads books i didnt order! Terrible connectstore support alexjlee Sony Reader 15 01-01-2007 06:26 PM


All times are GMT -4. The time now is 02:17 AM.


MobileRead.com is a privately owned, operated and funded community.