Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Readers > Sony Reader

Notices

Reply
 
Thread Tools Search this Thread
Old 12-22-2007, 08:30 AM   #1
dstampe
dstampe
dstampe began at the beginning.
 
Posts: 50
Karma: 17
Join Date: Jan 2007
Location: Canada
Device: Sony PRS-500
Question Cleaning books--Book Designer or other?

I have a number of rather battered e-book files that I need to massage into something useable. The biggest problem with these is that the paragraphs have been broken up into individual lines, in a way that is not easily recoverable into flowable unitary paragraphs.

For many cases, I've managed to create rather convoluted sequences of search and replace sequences in Word to join the paragraphs back together, if the original paragraphs were consistently flagged in some way (blank line after or indented with spaces/tabs. However, some documents don't even have these. Plus, the process is rather interactive and time consuming.

Also, there is the issue of all the chapter headings being wiped out, then having to be searched for and searated from the rest of the text. Sometimes search and replace can be used for this, but this is variable depending on each book.

Are there any tools that could automate or at least streamline this process?

I've considered Boook Designer for this, as a few experiments have shown that it can sometime recreate the original paragraphs in an acceptable way (breaking dialog, etc). But I'm not sure about whether it can do what I want or will just create a new mess to clean up. I've attempted to use it a number of times, but have always been stymied by one or more issues. Maybe someone can clear these up for me. Maybe some of these are obvious, but the sketchy help files, busy interface and tiny text means that I have trouble seeing some things with ny poor vision.

My ideal goal would be to use BookDesigner as a single tool to extract the text from PDF, PDB, LIT, and text files without going through the Word conversion and preclean stage. I would prefere RTF output, as this is the format I read in. It would also be ideal if original styles from DOC, RTF, and PDF files was left intact (italics seem most important). I am not interested in producing LRF eBooks because of the lack of left-justification which is needed for using large fonts properly. I am also not interested in pretty formatting--page breaks before chapter headings would be nice if BD does not have the flexibility to add 4 blank lines before (which I gather it doesn't).

Here are some of the issues that are keeping this from happening:

- I would prefer to save output in RTF format, but this always has hyphenation and indentation garbage added. I've enabled advanced RTF output, but this doesn't seem to do anything, and no options dialog comes up either from "Save As" or the "Make eBooks" route.

- Identifying chapters by keywords seems fine for "Chapter" and "CHAPTER" keywords, but what about numbered and Roman numeral chapter labels? Does someone have a list of chapter keywords that will work for most cases?

- How well does the "reformat completely" option really work at recombining paragraphs? I read that some users prefer to use Word to recombine Project Gutenberg books before using BD--this would seem to imply it doesn't work too well.

- Does BD preserve italics originally present in PDF and DOC/RTF files?

- Is there an easy way to strip headers/footers from pages during import? I have some text, PDB, and PDF files where it looks like someone has taken a perfectly good text document with flowed paragraphs, and paginated it and addded the headers and footers which then have to be re-stripped before the book can be adapted to a new reader or font size.


Anyway, all help would be appreciated. I'd like to get my tool set into better order before beginning another round of book cleanup, and a paragraph joining tool and chapter finding tool are the most needed items.
dstampe is offline   Reply With Quote
Old 12-22-2007, 02:11 PM   #2
Patricia
Reader
Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.Patricia ought to be getting tired of karma fortunes by now.
 
Patricia's Avatar
 
Posts: 11,520
Karma: 2199070
Join Date: May 2007
Location: South Wales, UK
Device: Sony PRS-500, PRS-505, Asus EEEpc 4G
Regarding Chapters:
In BD, go to configuration, then settings. At the top left-hand side of the pane, you can set the words to be chosen as titles. So, you could choose 'Chapter' or whatever. But you can make a list and save it.

Stripping headers and footers can be automated by using regular expressions.

BD does usually preserve italics.
Patricia is offline   Reply With Quote
 
Enthusiast
Old 12-22-2007, 04:07 PM   #3
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 36,219
Karma: 17169472
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Sony Reader PRS-650, iPad, nook STR
Yes BD does preserve italics. Bit, it can't always fix your paragraphs. The problem is because you've downloaded these books instead of purchasing them. Some downloaded books can be a right mess. It would be easier to just see if you can find them at an online ebook shop and purchase them. If any are not available, you'll just have to purchase the pbook so you can use that to fix the downloaded ebooks.
JSWolf is online now   Reply With Quote
Old 12-23-2007, 08:49 AM   #4
dstampe
dstampe
dstampe began at the beginning.
 
Posts: 50
Karma: 17
Join Date: Jan 2007
Location: Canada
Device: Sony PRS-500
I do own all of these books mostly in paperback. I have a huge book collection, which I can no longer easily read due to vision and other problems (I have become very allergic to newsprint/cheap paper). My goal is to replace at least some of the books I own with large-print electronic versions that are portable. When possible I do look for ebook versions to purchase, but except for Baen books I can't easily read most of these either due to the fonts the publishers chose. So I have to reformat the books to be able to read them easily. Plus being in Canada a lot of tthe newerr sources (Connect/Amazon) are not accessible for purchase.

If I could, I'd still be reading paper books as the selection is far superior. For a while I bought only new hardcovers, but even this no longer works. It's not a cost issue--I used to spend more on books than groceries. Having access to even part of my collection again is such a pleasure.

Anyway, I didn't want to bring all that up, and DEFINITELY not to get sidetracked into (yet another) discussion of legal stuff. Having all the out-of-print books I own scanned is not an option, even the damaged stuff available is better than the results of raw scans. All I'm looking for are some tools to make it easier to read books.

I am not after perfection; I can live with occasional misspellings and formatting errors. Small fonts, or bad serif fonts (like the default Reader font) make it impossible for me to read at all. Some errors, such as missing paragraph breaks, or missing quotes, do make reading much slower. Others such as em-dashes being replaced by en-dash, missing or added hyphens, chapter headings joined into the text, or full versus left justification for large fonts, are just annoying. (I have seen all of these in purchased e-books, looks like the publisher just scanned a printed book then did a quick proofreading).
dstampe is offline   Reply With Quote
Old 12-23-2007, 10:47 AM   #5
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 36,219
Karma: 17169472
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Sony Reader PRS-650, iPad, nook STR
Since you have the pbooks, just load into BD and use the pbook to edit the ebook.
JSWolf is online now   Reply With Quote
Old 12-23-2007, 03:18 PM   #6
RWood
Technogeezer
RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.RWood ought to be getting tired of karma fortunes by now.
 
RWood's Avatar
 
Posts: 7,233
Karma: 1596436
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
Backing up a minute. For those books that do have a blank line between paragraphs -- like those from Project Gutenberg -- I suggest you try Stingo's Word Macro (available through the MobileRead Wiki, link on the top left of this page.) I use it for preprocessing files before I load them into BD. Many of the other files could be converted to this standard form.

While BD is good, I have found it far better to do major editing outside the program rather than inside the program. Other disagree. Yur mileage may vary.
RWood is offline   Reply With Quote
Old 12-23-2007, 04:30 PM   #7
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 36,219
Karma: 17169472
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Sony Reader PRS-650, iPad, nook STR
I've found that in most cases, BD doesn't keep the line spaces. So what I do is use the source to view how to format.
JSWolf is online now   Reply With Quote
Old 12-23-2007, 06:06 PM   #8
dstampe
dstampe
dstampe began at the beginning.
 
Posts: 50
Karma: 17
Join Date: Jan 2007
Location: Canada
Device: Sony PRS-500
So the consensus seems to be to avoid using BD for this process if possible. Bummer.

I've seen the macros, and developed similar ones that handle a lot of the reformatting when there are ways to identify paragraph ends (double line breaks or indents). Haven't had as much luck creating "plausible" paragraphs where there are none, or identifying numbers/roman numeral labelled chapters.

The idea I had for detecting plausible paragraph ends rely on finding line breaks that arre preceded by sentence ends [.!?] and followed by sentence starts [A-Z]. This would also have to handle the case of quote marks (apostrophes for British books) wrapping the sentence start/end
. I really don't know how well this would work, I suspect it will be correct in 90% of cases. The 10% of errors will be acceptable, unless the error splits dialog between quotes. Problem is that tagged substitution is not as easy to use to flag blocks as a real parser would do.

Then there's the issue of recognizing chapter headings. It would be best to have a search criterion that matches all possible "hits" at once, but I'm not sure regular expressions can handle numbers, text, and Roman numerals in the same search string.

Any thoughts on these?
dstampe is offline   Reply With Quote
Old 12-30-2007, 07:03 PM   #9
dstampe
dstampe
dstampe began at the beginning.
 
Posts: 50
Karma: 17
Join Date: Jan 2007
Location: Canada
Device: Sony PRS-500
I have been working on some macros using Word, and these seem to do a fairly good job of splicing text. Somewhat crippled by the lack of full regular expressions in Word's search and replace, though.

In these examples, I have used "_" for a space, and "\" for backslash). I am just giving the find/replace strings, and am not going to do macro code examples. Someone else can test these on the current version of Word if they want, then summarize the macros. This is just to pass on the ideas.

The basic sequence is:

1) Ensure all paragraph marks are cleaned, so that these can be used in wildcard search (using the ^13 code):
Wildcards OFF:
Find: ^p
Replace: ^p

2) Clean spaces from beginning and ends of lines:
Wildcards OFF
Find ^w^p
Replace: ^p
Find ^p^w
Replace: ^p

Then clean up any unwanted blank lines, headers, footers, etc. Some books may also have quotes moved onto seperate lines, these need to be merged onto the prroper line as well.
It is also a good idea to remove hyphens at the end of lines (This should be done one by one):

1) Remove hyphens at end of lines (use interactive replace, check that text AFTER hypen is not a full word)
Wildcards ON
Find: -^13{1,3}([a-z])
Replace: \2

2) Remove any dangling quotes (may be uncommon). Note this is crippled ecause Word cannot search for "zero or more" of a search item):
Wildcards OFF
Find:^p"^p
Replace: "^p
or
Wildcards ON
Find:^13"^13
Replace: "^p

3) Headers and footers: can be a problem. One idea is to look for isolated lines with blank lines before and after, with numbers in them. This example looks for a line with a length of up to 60 characters. It uses "[!^13]" rather that "?" to force it to look at a single line. You can add matching for a number "<[0-9]@>" before or after the "[^13]{1,60}" item. Another alternative is to check for capitalized letters "[A-Z]{5,}" somewhere in the line.
Of cource, the replace here needs to be done interactively. It's a pain in Word sometimes, as the top of the found text is usually off the top of the display:

Wildcards ON
Find: ^13{2,6}[!^13]{1,60}^13{2,6}
Replace: <nothing>


Then the workhorse joining can be done:

1) join line with lowercase at start to previous line:
Wildcards ON
Find: ^13([a-z])
Replace: _\1

2) join line with lowercase at end to next line:
Wildcards ON
Find: ([a-z])^13
Replace: \1_

3) join line with comma at end to next line:
Wildcards ON
Find: ,^13
Replace: ,_

These simple replacements handle most books pretty well. Most other cases are ambiguous unless quotes are taken into account and are rare in practice. The longer the lines of text are, the fewer the errors.
dstampe is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Book Designer ndtvideas123 Sony Reader 19 09-20-2008 10:56 AM
TIP: Attention: All Book Designer users creating .IMP books... nrapallo IMP 19 03-08-2008 06:32 PM
Book Designer Vista and Allowing Book Covers and Pictures andyafro Sony Reader 0 01-22-2008 09:49 PM
Book Designer cftall Workshop 5 09-19-2007 06:45 PM
Using Book Designer and/or makelrf.ex to Reformat Sony Reader Books Vienna01 Workshop 2 12-03-2006 01:17 PM


All times are GMT -4. The time now is 03:30 PM.


MobileRead.com is a privately owned, operated and funded community.