MobileRead Forums - View Single Post

kevin_boone · 03-02-2009, 10:14 AM

Quote:

Originally Posted by murraypaul

Which is where being a progammer comes in handy.

I've now a utility that will scan through my library and promote PalmDoc to Text, PDF to Text, Text to Html, Html to LRF or Html to ePub, combine series intoa single LRF or ePub and dump the final files into a directory I can import into calibre to load onto the 505.
That way I can keep my books in whatever format I already have them, and get them automatically progressed up to the right end format. I would have thought you could do some of this with calibre's library functions.

Writing a program or script that can re-flow plain text documents so that they sort of work OK on the PRS is not really a problem. Applying such a problem to a batch of files is also not a problem.

The problem -- or so it seems to me -- is that it's exceptionally difficult to implement a program or script that will re-flow text documents _well_, in a completely hands-off manner. And that's what you need to batch-convert a whole heap of stuff. Unless you're not very fussy, I suppose.

Of course, you might be lucky. You might have a heap of text documents that are all very similar in format. They might all have neat right margins, nice spacing between paragraphs, nice spacing under titles, etc. But my experience is the opposite. Browsing through a few of the text ebooks I've got on my PC I find

* Files with average 50-100 characters per line
* Files with no hard line breaks at all, except at the ends of pages
* Files where para breaks are indicated only by space indents
* Files with blank lines between text lines (like double-spacing on a typewriter)
* Files with extra white-space padding at the start of line (so you can't assume that white space at the start is a para break)
* Files with hard page feeds and page numbers embedded
* Files with other headers and footers interspersed with text
* Files with no spacing between titles and text
* Files with weird (Microsoft?) symbols where there should be quote marks and hyphens
* Purported text files that actually have HTML tags and entities in
* Files with word hyphenation hard-coded (so you can't distinguish between a dash and a word split by a hyphen)

The code I've written can cope with all these anomalies, but it can't _guess_ which ones are present in a given file. I don't know of any program that can but, if you do, I'd love to know about it. I'd love to be able to batch-process my 9,000 or so text ebooks and be sure of getting a reasonable read on the PRS for every one.

In practice, my experience is that, even with software, I have to manually inspect and tweak pretty much every text document I want to put on the reader.