12-02-2008, 06:21 AM
I'm trying to convert a book at the Internet Archive into a LRF file for my Sony 505. The text is available in TXT, DJVU and PDF formats. The TXT file isn't pure text, but contains some HTML. The name of the TXT file also has the letter sequence DJVU, suggesting that the format is related to the DJVU format. Meanwhile, the actual DJVU file is about 20x the size of the TXT file (about 20M) and the PDF file is around 50Megs.
Which is the best file format for converting to LRF? What is the best program for converting one of these formats to compact LRF?
12-04-2008, 07:37 PM
I believe that the text files are automatic conversions from the djvu files, and the file extension is preserved, even though the result is txt.
I convert the text files in exactly the same way as I convert any other text file. I drop the file into a Doc and then edit, befor using either Book Designer or Calibre for the final conversion.
Unfortunately, the Internet Archive text files are of very poor quality, and require many hours of proofreading before conversion.
1. You will need to strip out the headers and footers.
2. You need to restore the italics.
3. You need to correct the OCR errors.
12-04-2008, 07:51 PM
I have used pdflrf (http://www.mobileread.com/forums/showthread.php?t=13135) to convert DJVU files to LRF in the past. The problem you are facing is that it will render a full page of text as a full page graphic image that will not be readable on a 6" screen.
I have converted a number of books from the Internet Archives to LRF (part of the Harvard Classics series) and I had to bypass the provided TXT files due to the quality of their OCR. What I did was to edit the PDFs in Adobe Acrobat to remove the header and footer of each page and then convert the resultant file in ABBYY PDF Transformer 2.0. This yeilded a far superior OCR that required perhaps only 10% of the editing time that the Internet Archive TXT files would have required.
12-05-2008, 12:31 PM
Thanks Patricia and RWood. The OCR errors and format irregularities in the TXT files are pretty bad. Plus, the pagination of the original book (from which the scan was taken) has been retained, so I need to go back and excise the footers. This is going to be LOTS of work.
RWood, did you use the Internet Archives PDFs? Are they text or image-based? The PDF for my book is 50M. I didn't download it because my ISP meters bandwidth usage, and my connection is in constant use already.
The DJVU format files are quite complex, consisting of a number of "layers". The text contained in them is OCRd from the original scans and as noted earlier is flaky in parts (some files I have downloaded have whole pages of OCRd text missing). I know there is a foreground and a background image layer as you can switch off the background layer for easier viewing.
You can extract the text from the DJVU but finish up with the same text as you can d/l from the Internet Archive direct.
As the PDFs are not text searchable I think they are just image containers.
The "text" layer in the DJVU enables you to search the text and dispaly the corresponding image page.
Bottom line is - use either the DJVU and extract the text from it or just grab the .djvu.txt file depending on whether you want to manually edit the text to align with the original before converting to eBook format. Both versions suffer from page numbers, headers etc being interspersed with the text.