Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 10-25-2011, 07:29 AM   #1
BranMakMorn
Enthusiast
BranMakMorn began at the beginning.
 
Posts: 30
Karma: 10
Join Date: Jan 2010
Device: none
Question about OCRd djvu and pdf with ABBYY

Hello everyone,

I've got a nagging problem which I didn't manage to solve browsing this section of the forum. So here it is: I have some books in .djvu format that I want to convert to .pdf PRESERVING THE OCR so that I can read and annotate them on iPad.

Now, I can of course open the djvu with ABBYY Finereader: it will scan the whole document and read the text, usually doing a very good job.

BUT. When I produce the OCRd .pdf, it will be a 'copy' of the original text, not the page-as-it-was. In other words: I don't want to have a 're-typed' copy of the book (also because ABBYY does an awful job with numbered footnotes), I want to keep the EXACT same looks of the printed book (font, spacings...everything).

I can achieve this if I simply 'print' the djvu file as a .pdf of course. But if I do this, I lose the searchable text, it will just be an image.

So the question would be: Is there any way to convert a djvu file, preserving BOTH ORCd text (searchability) AND general outlook?

Thank you!
BranMakMorn is offline   Reply With Quote
Old 10-25-2011, 10:00 AM   #2
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
AFAIK, FineReader opens DjVu files as JPG images (well, perhaps not JPG, but something close, maybe a compressed TIFF, I'm not really sure).


The quick and dirty method would be to have the OCR text under the image. In ABBYY FineReader 10 (haven't tried 11 yet), you can export PDFs with Text under the page image:




However, this wouldn't be any better than your average DjVu file with text underneath. The quality method, would be to properly proof-read after OCR-ing which takes time and patience - it basically means you read the whole book using ABBYY FineReader once, and once more the final version using Foxit Reader, Adobe Reader, Apple's iBooks, etc.

I think retouching should be done either using Word 2010 SP1 (for .docx), or LibreOffice 3.4.3 (for .rtf), the latest right now. Not some shoddy/half-assed text processing program.


If you want to preserve "the EXACT same look of the printed book", font matching can be a pain sometimes, especially since most publishers use commercial fonts. But it's totally worth it if you do it right.

I mean, sure, you could use a close match using one of the websites bellow. Or, if you're willing to go the extra mile track down the commercial variants (which I think is probably called piracy - but hey, you're not making any money off of it... are you?).

http://www.identifont.com
http://www.whatthefont.com
http://www.whatfontis.com

As a last resort, ask for someone's help: http://typophile.com/typeid

Hope this helps! Some Word 2010 training videos wouldn't hurt either. Learn to make use of macros instead of editing character spacing by hand. It would save a lot (and I mean A LOT) of time. Hook them up to hotkeys instead of right clicking - Font - Advanced... etc. etc.
DSpider is offline   Reply With Quote
Old 10-25-2011, 03:56 PM   #3
BranMakMorn
Enthusiast
BranMakMorn began at the beginning.
 
Posts: 30
Karma: 10
Join Date: Jan 2010
Device: none
Thanks a lot man, this was one thorough reply. I don't have time to try it out right now, but I certainly will tomorrow.

Thanks a lot!
BranMakMorn is offline   Reply With Quote
Old 10-26-2011, 01:17 AM   #4
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
It is even easier. Save the file in ABBYY as PDF/A.
Toxaris is offline   Reply With Quote
Reply

Tags
abbyy, conversion, dvju, ocr, pdf

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[Old Thread] Removing ABBYY header in a PDF robertlc Conversion 33 09-09-2011 12:12 AM
PRS-650 OCR software/Abbyy Finereader-Highlighting –Export pdf w.notes, highlighted passages wonderose Sony Reader 4 04-27-2011 10:41 PM
Any way to open a PDF in ABBYY 9.0 without actually processing the pages? Ea Workshop 3 03-07-2010 05:52 AM
Ignore Headers & Footers in PDF when scanning in ABBYY PieOPah Workshop 5 08-28-2009 01:55 AM
Strikethrough in ABBYY/PDF eurotrash Workshop 5 10-29-2008 01:44 PM


All times are GMT -4. The time now is 08:00 AM.


MobileRead.com is a privately owned, operated and funded community.