MobileRead Forums - View Single Post - Confused about DJVU files and converting to LRF

BobC · 12-08-2008, 04:37 PM

The DJVU format files are quite complex, consisting of a number of "layers". The text contained in them is OCRd from the original scans and as noted earlier is flaky in parts (some files I have downloaded have whole pages of OCRd text missing). I know there is a foreground and a background image layer as you can switch off the background layer for easier viewing.

You can extract the text from the DJVU but finish up with the same text as you can d/l from the Internet Archive direct.

As the PDFs are not text searchable I think they are just image containers.

The "text" layer in the DJVU enables you to search the text and dispaly the corresponding image page.

Bottom line is - use either the DJVU and extract the text from it or just grab the .djvu.txt file depending on whether you want to manually edit the text to align with the original before converting to eBook format. Both versions suffer from page numbers, headers etc being interspersed with the text.

BobC

12-08-2008, 04:37 PM	#5
BobC Guru Posts: 691 Karma: 3026110 Join Date: Dec 2008 Location: Lancashire, U.K. Device: BeBook 1, BeBook Pure, Kobo Glo, (and HD),Energy Sistem EReader Pro +	The DJVU format files are quite complex, consisting of a number of "layers". The text contained in them is OCRd from the original scans and as noted earlier is flaky in parts (some files I have downloaded have whole pages of OCRd text missing). I know there is a foreground and a background image layer as you can switch off the background layer for easier viewing. You can extract the text from the DJVU but finish up with the same text as you can d/l from the Internet Archive direct. As the PDFs are not text searchable I think they are just image containers. The "text" layer in the DJVU enables you to search the text and dispaly the corresponding image page. Bottom line is - use either the DJVU and extract the text from it or just grab the .djvu.txt file depending on whether you want to manually edit the text to align with the original before converting to eBook format. Both versions suffer from page numbers, headers etc being interspersed with the text. BobC