View Full Version : Quick Reformatting of Terrible E-Books


Raventhon
06-04-2007, 12:56 AM
I've been thinking, and as I don't actually know perl myself I'm unable to write the script for it, but a script that did the following would be amazingly useful in formatting eBooks for viewing on mobile devices.

Scan all files in a directory (and subdirectories, hey, why not) and replace all instances of <newline> not immediately followed by either <newline> or <tab> with a single space.

Reasoning behind this: I've seen entirely too many eBooks formatted such that they use manual line breaks instead of using the word wrap feature, and when transferred to a mobile device, you end up with fragmented lines:

"This is a bunch of text serving as example of
incorrect word
wrap due to stupid formatting of eBooks. I really
wish there was
some way to fix it, because it's almost impossible
to read this
terribly formatted text."

Can anyone think of any files they have that this command would damage? I wouldn't want to run it on poetry, but other than that, it seems that this script can be safely run on normally-formatted eBooks without changing anything.

Raventhon
06-04-2007, 01:00 AM
Note to self: Check previous posts before making new post. Your question may already have been discussed.

mogui
06-04-2007, 01:15 AM
Sometimes it is hard to know what to search for. I am familiar with this thread (http://www.mobileread.com/forums/showthread.php?t=10093&highlight=windows+scripting) and this one too (http://www.mobileread.com/forums/showthread.php?t=9487&highlight=windows+scripting) that discuss scripting and handling the text-formatting problem that concerns you.

I hope this helps.

JSWolf
06-05-2007, 12:46 PM
WYSIWG Editor is broken

JSWolf
06-05-2007, 12:48 PM
I've been thinking, and as I don't actually know perl myself I'm unable to write the script for it, but a script that did the following would be amazingly useful in formatting eBooks for viewing on mobile devices.

Scan all files in a directory (and subdirectories, hey, why not) and replace all instances of <newline> not immediately followed by either <newline> or <tab> with a single space.

Reasoning behind this: I've seen entirely too many eBooks formatted such that they use manual line breaks instead of using the word wrap feature, and when transferred to a mobile device, you end up with fragmented lines:

"This is a bunch of text serving as example of
incorrect word
wrap due to stupid formatting of eBooks. I really
wish there was
some way to fix it, because it's almost impossible
to read this
terribly formatted text."

Can anyone think of any files they have that this command would damage? I wouldn't want to run it on poetry, but other than that, it seems that this script can be safely run on normally-formatted eBooks without changing anything.
Are you talking about purchased, "downloaded", or books from sites like Project Gutenberg?

Patricia
06-21-2007, 12:13 PM
With Project Gutenberg books in text file format, I just paste them into a word document, then run Stingo's Macro. This only takes a couple of minutes and solves the hard carriage breaks.

JSWolf
06-21-2007, 03:59 PM
With Project Gutenberg books in text file format, I just paste them into a word document, then run Stingo's Macro. This only takes a couple of minutes and solves the hard carriage breaks.
If there is an HTML version available, I go for that one. You'll get images if there are any, and italics. Its not hard to work with the HTML in Book Designer. if you use the text file instead, you lose what attributes and images there might be. So please use the HTML when one exists.

nekokami
06-21-2007, 05:46 PM
Stingo's macro just looks for double paragraph marks, doesn't it? Won't help if you have a file that doesn't have an extra line between paragraphs (as often happens with files that have been through a PDF stage somewhere in their history). I've been thinking of writing a perl script to make a "best guess" based on line length. I'll be doing some perl work this summer, and may have a chance to slip it in then. I'll post it somewhere on mobileread (in the wiki, maybe) if I get a reasonable version working.

JSWolf
06-21-2007, 09:47 PM
With the HTML from PG, there is no need to have to reformat it to remove the extra line spaces. It works just fine in BD as is. And if there are line spaces, they are meant to be there.

mogui
06-22-2007, 12:22 AM
When designing scripts to deal with hard carriage returns, it is good to be able to actually see which character codes are causing the problem. A programmer's editor is the tool to start with for your basic research. You can read more here (http://www.mobileread.com/forums/showpost.php?p=62702&postcount=12).

TadW
06-22-2007, 09:32 AM
If you deal with pre-formatting PG books, also check out their faq (http://www.gutenberg.org/wiki/Gutenberg:Readers'_FAQ) which provides some useful tips.

There are some applications that specifically assist with auto-converting text into HTML:

* GutenMark http://www.sandroid.org/GutenMark was specifically written for the purpose, and knows enough about PG conventions to do a very good job.

* InterParse http://www.interparse.com is a Windows-based generic text parser that is very easy and intuitive to use.

* The World Wide Web Consortium lists some other options at http://www.w3.org/Tools/Misc_filters.html

mogui
06-22-2007, 12:17 PM
Let me give an example:
My favorite file format for the Reader is plain old ASCII text. The title on the Reader turns out to be the same as the filename. I like that. I can make text files from many other formats. The middle font size on the Reader is right for normal reading and then I can go one bigger if the lighting is bad. I don't have to experiment a lot when I am in a hurry to read something.

So I got a book in lit format and converted it to lrf. The resultant line formatting was just plain ugly. There were sentence fragments everywhere and way too many spaces between lines. I decided to tighten it up.

I used Amber lit converter (abclit) to convert the original lit file to text. Then I opened the file in PSPad (see earlier post for source). I used the hex display mode to examine the character structure of the ugliness. I noticed that there were $0d$0a pairs everywhere. That is a carriage return line feed combination.

But at the beginning of every real paragraph there was an $a0 character. That is a space character with the high order bit set, I don't know why anybody put that character in there. It is not common. But I liked it because it gave me a way to reformat everything easily.

First I used search and replace to find all the $0d$0a pairs and replace them with $20 (space). Then I replaced all the $a0 characters with $0d$0a pairs. The result was pure beauty! The paragraphs all flowed well and there were no unwanted line spaces.

It took five minutes!

RWood
06-22-2007, 12:39 PM
While MS Word (and OpenOffice) are nice tools for documents, nothing beats Ultra Edit for major work on raw text files. I use it frequently when preparing the Harvard Classics series of books. It is a commercial product; but, for me it has been worth it.

monkpalmer
08-03-2007, 11:20 PM
eBooks formatted such that they use manual line breaks instead of using the word wrap feature, and when transferred to a mobile device, you end up with fragmented lines

I use a freeware program called 'E-book Tidy' that fixes this. I use it for all my Gutenberg texts. It does more besides:

Join Lines
Join Quotes
Split Lines at Page Width
Remove Blank Lines
Remove Extra Blank Lines
Add Carriage Return after Paragraph.
Trim Right Spaces
Indent Paragraphs
Unindent Paragraphs
Remove Numeric only lines
Delete initial numeric
Delete trailing numeric
Convert extended Ascii
Remove Extra Spaces
To Uppercase
To Lowercase
To Sentence Case
Invert Case
Convert Single to Double Quotes
Spell Check the document

It's available here:
http://www.simtel.net/product.php%5Burl_fb_product_page%5D76296

JSWolf
08-04-2007, 12:15 AM
While MS Word (and OpenOffice) are nice tools for documents, nothing beats Ultra Edit for major work on raw text files. I use it frequently when preparing the Harvard Classics series of books. It is a commercial product; but, for me it has been worth it.
What can Ultra Edit do that Word cannot or what can Ultra Edit do better?

I've been thinking of trying to find a better text editor then Notepad.

mogui
08-04-2007, 12:19 AM
monkpalmer, does this tool operate on the entire text in a "setup and then process" mode, or do you use it to go through the entire file manually correcting it? Thanks for the link and info.

RWood
08-04-2007, 01:08 AM
What can Ultra Edit do that Word cannot or what can Ultra Edit do better?

I've been thinking of trying to find a better text editor then Notepad.
While not wanting to sound like a commercial for UltraEdit, here is a link (http://www.ultraedit.com/index.php?name=UE_MoreFeatures) to the main features of the product.

I spent many years programming (mainframes, minis, and micros) and this is the best editor I have ever used. It is an excellent replacement for Notepad.

How it differs from Word is mainly in its philosophy of operations. Everything is displayed, there are no hidden codes. (This is wonderful when you are setting up hyperlinks.) It also has a fully functional hex editor that I wrongly thought I would never need with the Harvard Classics series.

There is a 30 or 45 days free trial of the product.

kovidgoyal
08-04-2007, 04:03 AM
What can Ultra Edit do that Word cannot or what can Ultra Edit do better?

I've been thinking of trying to find a better text editor then Notepad.

There's notepad++ which is free.

andym
08-04-2007, 05:05 AM
Could I ask if anyone has any tips on the least labour-intensive way of replacing ascii quotation marks (' ") with proper curly quotes either on Mac (TextMate?) or Windows thanks.

monkpalmer
08-04-2007, 07:30 AM
mogui, all you have to do is
1. open from within "E-Book Tidy" the text doc you want to reformat
2. press a button.
The program does the whole thing for you in a couple of seconds. There's a preview tab as well, so you can check on how your Gutenberg text is shaping up.

nekokami
08-04-2007, 02:35 PM
If you find you need more flexible search/replace and other text processing functionality and you want to use the same rules on a whole batch of files at once, you might want to look at http://www.datamystic.com/textpipe.html. It's not free, but the "lite" version is under US$50. I'm using the "pro" version for another (non ebook related) task and it's quite powerful, and fairly easy to use. (I have no other connection with this company or product.)

mogui
08-04-2007, 11:48 PM
We often forget the old tools. AWK is available for windows (http://plan9.bell-labs.com/cm/cs/awkbook/), as are mawk (http://gnuwin32.sourceforge.net/packages/mawk.htm) and gawk (http://gnuwin32.sourceforge.net/packages/gawk.htm). This page (http://www.student.northpark.edu/pemente/awk.htm) links to tutorials and contains some scripts. The textpipe folks say textpipe is better than awk -- quicker to program. Awk is free, has useful variants, and lots of free scripts and tutorials. Your choice.

Awk and sed have been around since the beginning of time. There are forums for getting help and getting scripts. Imagine writing an awk script that formats your Gutenberg text files just the way you like them and then running that script in batch mode on entire directories. Explore the world of awk scripts (http://www.google.com/search?q=awk+script&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a).

Or you can use E-Book Tidy. Thanks monkpalmer. It is always good to have a choice of tools.

Gutenmark (http://www.sandroid.org/GutenMark/index.html) takes Gutenberg text files and converts them to nicely formatted HTML (ftp://ftp.sandroid.org/GutenMark/Foundation/browser.html). Oh, how I wish I could use HTML on my Reader!

Replacing ASCII quote marks with the left and right versions ought not be difficult for an awk script writer. If you want a one-click solution, write the script and then upload it here.

nekokami
08-05-2007, 08:18 PM
Oh sure, awk is very powerful. So is perl. I'm just saying that some categories of repetitive tasks have been needed by so many people that other tools have been created that are easier to use -- for those tasks.

DaleDe
08-10-2007, 07:02 PM
What can Ultra Edit do that Word cannot or what can Ultra Edit do better?

I've been thinking of trying to find a better text editor then Notepad.

I use a Windows version of Emacs. It is a super editor and can do most anything you need.

andym
08-23-2007, 04:36 PM
There seems to be a problem with the GutenMark download pages. I don't suppose anyone has a copy of either the OSX compiled tarball or the Windows compiled Zip that they could upload?

My simple primitive approach to the curly quotes problem was to do a replace all on the '."' ',"' '?"' and then when I've eliminated all the right-hand double quotes simply do a replace all on the left hand double quotes. Similar approach to single quotes and apostrophes.

DMcCunney
08-23-2007, 05:06 PM
What can Ultra Edit do that Word cannot or what can Ultra Edit do better?

I've been thinking of trying to find a better text editor then Notepad.There are a boatload of Notepad replacements. Look here for a sampling: http://texteditors.org

Notepad replacements have their own category. Look under TextEditorFamilies.

I'm currently using Notepad++, one of a batch of free, open source text editors based on the Scintilla edit control, but I've used a number of others.

If all you want to do is replace Notepad, Ultra Edit is overkill.
______
Dennis
Collector of Text Editors

HarryT
08-27-2007, 02:27 PM
There seems to be a problem with the GutenMark download pages. I don't suppose anyone has a copy of either the OSX compiled tarball or the Windows compiled Zip that they could upload?

My simple primitive approach to the curly quotes problem was to do a replace all on the '."' ',"' '?"' and then when I've eliminated all the right-hand double quotes simply do a replace all on the left hand double quotes. Similar approach to single quotes and apostrophes.

It is worth remembering that "curly quotes" are not a part of the standard ASCII character set, and hence their display is code page dependent (unless one is dealing with Unicode, of course). They will display correctly on the "Latin 1" code page that most English speaking, and Western European, countries use, but not elsewhere.

It's best to stick with straight ASCII quotes if you want everyone to be able to display your book correctly.

andym
08-31-2007, 03:21 AM
It is worth remembering that "curly quotes" are not a part of the standard ASCII character set, and hence their display is code page dependent (unless one is dealing with Unicode, of course). They will display correctly on the "Latin 1" code page that most English speaking, and Western European, countries use, but not elsewhere.

It's best to stick with straight ASCII quotes if you want everyone to be able to display your book correctly.

I was talking about converting to html where the entities "“", "”" etc are supported in html as well as by mobipocket reader and in the ipdf standards. I think you can have both universality and retain the benefits of proper typography (except of course proportional spacing).

HarryT
08-31-2007, 01:17 PM
As long as you do use the HTML entities then you're right - there's no problem. Unfortunately it's very common to find them as straight "characters" in files, which then don't display correctly for everyone.

andym
08-31-2007, 01:39 PM
As long as you do use the HTML entities then you're right - there's no problem. Unfortunately it's very common to find them as straight "characters" in files, which then don't display correctly for everyone.

D'oh! The html entities in my last post showed as quotation marks (doesn't seem to be a way to disable html).

If anyone is interested, there's a list of the html entities supported by mobipocket here (http://www.mobipocket.com/dev/article.asp?BaseFolder=prcgen&File=supported_char_entities.htm). Unfortunately, the formatting is pretty scrambled. It seems to be the same as the Open eBook list which you can download from the ipdf.org site - the link is here: http://www.idpf.org/oebps/oebps1.2/download/oeb12-dtd.zip. The list has the suffix .ent but it should be possible to open it in a decent text editor. [Edit - direct link here: http://openebook.org/dtds/oeb-1.2/oeb12.ent opens in a browser]

Elltrain
09-18-2007, 02:12 AM
No love for vim in here? :angry:

DMcCunney
09-21-2007, 01:49 PM
No love for vim in here? :angry:There is from me.

But then, I collect text editors, and have a lot installed to look at.
______
Dennis

mogui
09-22-2007, 01:07 AM
Text editors are like religions. Everyone defends their favorite. Mine was emacs for a long time. Friends tried to convert me to vi. I saw its advantages but I had many useful habits developed over many years in emacs.

Any text editor I use now has to have good macro facilities. My current favorite is PSPad (http://www.pspad.com/). It does everything I need, it is free, and it will display and edit in hex mode -- very useful! (forum here (http://forum.pspad.com/))

I know. Now someone will say, "Mine is better!"

jasonkchapman
09-22-2007, 11:55 AM
I get a kick out of text editor brawls. The vi-emacs battles tend to be the most entertaining. They make Windows-Mac dustups look like tea parties.

nekokami
09-22-2007, 03:16 PM
I get a kick out of text editor brawls. The vi-emacs battles tend to be the most entertaining. They make Windows-Mac dustups look like tea parties.

:roflmfao:

(Never had a good excuse to use this smiley before -- thanks!)

mogui
09-22-2007, 11:08 PM
How do you make a vi user laugh on Saturday?

Tell him a joke on Wednesday.

RWood
09-22-2007, 11:26 PM
How do you make a vi user laugh on Saturday?

Tell him a joke on Wednesday.
I remember when that one used "DEC Systems Programmer" rather than "vi user." It was still funny. Then again there was the one about the IBM/360 Internals person with the catch line of "tell him a joke last spring."

brewt
10-02-2007, 11:39 PM
TextSpresso from Taylor Design:
http://www.taylor-design.com/textspresso/overview.htm

Sure, it's $25. And the included editor sucks. But, it will work on batches of text files with filters to change things like this:

"This is a bunch of text serving as example of
incorrect word
wrap due to stupid formatting of eBooks. I really
wish there was
some way to fix it, because it's almost impossible
to read this
terribly formatted text."

to this:

"This is a bunch of text serving as example of incorrect word wrap due to stupid formatting of eBooks. I really wish there was some way to fix it, because it's almost impossible to read this terribly formatted text."

through the clipboard.

About 300 built-in filters, with BASIC programmability, user-defined filters, etc.

Oh, and it wreaks havoc in auto-formatting/stylizing editors, like Book Designer. But it's great in Word, or TED: http://jsimlo.sk/notepad/. Can't keep house without it.

-bjc