Confused about DJVU files and converting to LRF

BBRags · 12-02-2008, 06:21 AM

I'm trying to convert a book at the Internet Archive into a LRF file for my Sony 505. The text is available in TXT, DJVU and PDF formats. The TXT file isn't pure text, but contains some HTML. The name of the TXT file also has the letter sequence DJVU, suggesting that the format is related to the DJVU format. Meanwhile, the actual DJVU file is about 20x the size of the TXT file (about 20M) and the PDF file is around 50Megs.

Which is the best file format for converting to LRF? What is the best program for converting one of these formats to compact LRF?

Patricia · 12-04-2008, 07:37 PM

I believe that the text files are automatic conversions from the djvu files, and the file extension is preserved, even though the result is txt.

I convert the text files in exactly the same way as I convert any other text file. I drop the file into a Doc and then edit, befor using either Book Designer or Calibre for the final conversion.
Unfortunately, the Internet Archive text files are of very poor quality, and require many hours of proofreading before conversion.
1. You will need to strip out the headers and footers.
2. You need to restore the italics.
3. You need to correct the OCR errors.

RWood · 12-04-2008, 07:51 PM

I have used pdflrf to convert DJVU files to LRF in the past. The problem you are facing is that it will render a full page of text as a full page graphic image that will not be readable on a 6" screen.

I have converted a number of books from the Internet Archives to LRF (part of the Harvard Classics series) and I had to bypass the provided TXT files due to the quality of their OCR. What I did was to edit the PDFs in Adobe Acrobat to remove the header and footer of each page and then convert the resultant file in ABBYY PDF Transformer 2.0. This yeilded a far superior OCR that required perhaps only 10% of the editing time that the Internet Archive TXT files would have required.

BBRags · 12-05-2008, 12:31 PM

Thanks Patricia and RWood. The OCR errors and format irregularities in the TXT files are pretty bad. Plus, the pagination of the original book (from which the scan was taken) has been retained, so I need to go back and excise the footers. This is going to be LOTS of work.

RWood, did you use the Internet Archives PDFs? Are they text or image-based? The PDF for my book is 50M. I didn't download it because my ISP meters bandwidth usage, and my connection is in constant use already.

BobC · 12-08-2008, 04:37 PM

The DJVU format files are quite complex, consisting of a number of "layers". The text contained in them is OCRd from the original scans and as noted earlier is flaky in parts (some files I have downloaded have whole pages of OCRd text missing). I know there is a foreground and a background image layer as you can switch off the background layer for easier viewing.

You can extract the text from the DJVU but finish up with the same text as you can d/l from the Internet Archive direct.

As the PDFs are not text searchable I think they are just image containers.

The "text" layer in the DJVU enables you to search the text and dispaly the corresponding image page.

Bottom line is - use either the DJVU and extract the text from it or just grab the .djvu.txt file depending on whether you want to manually edit the text to align with the original before converting to eBook format. Both versions suffer from page numbers, headers etc being interspersed with the text.

BobC

12-02-2008, 06:21 AM	#1
BBRags Connoisseur Posts: 59 Karma: 12 Join Date: Nov 2008 Device: None	Confused about DJVU files and converting to LRF I'm trying to convert a book at the Internet Archive into a LRF file for my Sony 505. The text is available in TXT, DJVU and PDF formats. The TXT file isn't pure text, but contains some HTML. The name of the TXT file also has the letter sequence DJVU, suggesting that the format is related to the DJVU format. Meanwhile, the actual DJVU file is about 20x the size of the TXT file (about 20M) and the PDF file is around 50Megs. Which is the best file format for converting to LRF? What is the best program for converting one of these formats to compact LRF?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Problem converting HTML files to LRF	red_five	Calibre	2	06-09-2009 03:03 AM
converting files to lrf in Ebook library	josecastanon1	Sony Reader	3	04-08-2009 05:42 PM
libprs500 Issues Converting .LIT to .LRF - .LRF crashes everything	vasbinde	Calibre	6	02-14-2008 12:16 PM
New PDF to LRF Tool (for DJVU and CBZ files too)	RWood	Sony Reader	0	08-29-2007 02:13 PM
Converting LIT to LRF Woes (or: Trouble with Images in LIT Files)	JEMelby	Sony Reader	0	07-27-2007 09:18 PM

12-04-2008, 07:37 PM	#2
Patricia Reader Posts: 11,504 Karma: 8720163 Join Date: May 2007 Location: South Wales, UK Device: Sony PRS-500, PRS-505, Asus EEEpc 4G	I believe that the text files are automatic conversions from the djvu files, and the file extension is preserved, even though the result is txt. I convert the text files in exactly the same way as I convert any other text file. I drop the file into a Doc and then edit, befor using either Book Designer or Calibre for the final conversion. Unfortunately, the Internet Archive text files are of very poor quality, and require many hours of proofreading before conversion. 1. You will need to strip out the headers and footers. 2. You need to restore the italics. 3. You need to correct the OCR errors.

12-04-2008, 07:51 PM	#3
RWood Technogeezer Posts: 7,233 Karma: 1601464 Join Date: Nov 2006 Location: Virginia, USA Device: Sony PRS-500	I have used pdflrf to convert DJVU files to LRF in the past. The problem you are facing is that it will render a full page of text as a full page graphic image that will not be readable on a 6" screen. I have converted a number of books from the Internet Archives to LRF (part of the Harvard Classics series) and I had to bypass the provided TXT files due to the quality of their OCR. What I did was to edit the PDFs in Adobe Acrobat to remove the header and footer of each page and then convert the resultant file in ABBYY PDF Transformer 2.0. This yeilded a far superior OCR that required perhaps only 10% of the editing time that the Internet Archive TXT files would have required.

12-05-2008, 12:31 PM	#4
BBRags Connoisseur Posts: 59 Karma: 12 Join Date: Nov 2008 Device: None	Thanks Patricia and RWood. The OCR errors and format irregularities in the TXT files are pretty bad. Plus, the pagination of the original book (from which the scan was taken) has been retained, so I need to go back and excise the footers. This is going to be LOTS of work. RWood, did you use the Internet Archives PDFs? Are they text or image-based? The PDF for my book is 50M. I didn't download it because my ISP meters bandwidth usage, and my connection is in constant use already.

12-08-2008, 04:37 PM	#5
BobC Guru Posts: 691 Karma: 3026110 Join Date: Dec 2008 Location: Lancashire, U.K. Device: BeBook 1, BeBook Pure, Kobo Glo, (and HD),Energy Sistem EReader Pro +	The DJVU format files are quite complex, consisting of a number of "layers". The text contained in them is OCRd from the original scans and as noted earlier is flaky in parts (some files I have downloaded have whole pages of OCRd text missing). I know there is a foreground and a background image layer as you can switch off the background layer for easier viewing. You can extract the text from the DJVU but finish up with the same text as you can d/l from the Internet Archive direct. As the PDFs are not text searchable I think they are just image containers. The "text" layer in the DJVU enables you to search the text and dispaly the corresponding image page. Bottom line is - use either the DJVU and extract the text from it or just grab the .djvu.txt file depending on whether you want to manually edit the text to align with the original before converting to eBook format. Both versions suffer from page numbers, headers etc being interspersed with the text. BobC

Advert

Advert