View Single Post
Old 12-08-2008, 04:37 PM   #5
BobC
Addict
BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.
 
Posts: 339
Karma: 245756
Join Date: Dec 2008
Location: Lancashire, U.K.
Device: BeBook 1, BeBook Pure, Kobo Glo, Various Android Apps
The DJVU format files are quite complex, consisting of a number of "layers". The text contained in them is OCRd from the original scans and as noted earlier is flaky in parts (some files I have downloaded have whole pages of OCRd text missing). I know there is a foreground and a background image layer as you can switch off the background layer for easier viewing.

You can extract the text from the DJVU but finish up with the same text as you can d/l from the Internet Archive direct.

As the PDFs are not text searchable I think they are just image containers.

The "text" layer in the DJVU enables you to search the text and dispaly the corresponding image page.

Bottom line is - use either the DJVU and extract the text from it or just grab the .djvu.txt file depending on whether you want to manually edit the text to align with the original before converting to eBook format. Both versions suffer from page numbers, headers etc being interspersed with the text.

BobC
BobC is offline   Reply With Quote