View Single Post
Old 03-28-2013, 02:34 PM   #13
BobC
Guru
BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.
 
Posts: 691
Karma: 3026110
Join Date: Dec 2008
Location: Lancashire, U.K.
Device: BeBook 1, BeBook Pure, Kobo Glo, (and HD),Energy Sistem EReader Pro +
Did you download the original from Archives.org ?

If so then it's not surprising that there is a lot of poor formatting as the OCR'd text in those books was really intended to provide a text search layer for the DJVU (and perhaps PDF) versions so the text hasn't been cleaned up before they used some process to reformat it as EPUB. You sometimes find whole pages that have missed being OCRd - the original image shows up in the DJVU but is not mapped to any text.

The point here is that something like OpenOffice/LibreOffice is an ideal way to edit the basic words, sentences etc in the document without having to worry about the html markup in search & replace.

My approach for these books is to download the text file and then use OpenOffice to edit it using a set of macros for general text tidying then either import the resultant .odt file into Calibre or use the W2Epub extension to produce the EPUB.

I think the d/l'd text files are in markdown format so you could do an initial conversion to extract headers etc using some other s/w or use macros to translate headings etc. Exporting the text using a DJVU viewer gives simple text without the markup. I'm pretty sure that neither have any italic or bold formatting in the extracted text and the only way to re-introduce them is simply by comparing the image version with the corresponding text. It's a painful process as I know, and one made worse if your book is full of dialect or foreign words which shows up as mis-spelling.

BobC
BobC is offline   Reply With Quote