Question about OCRd djvu and pdf with ABBYY

BranMakMorn · 10-25-2011, 07:29 AM

Hello everyone,

I've got a nagging problem which I didn't manage to solve browsing this section of the forum. So here it is: I have some books in .djvu format that I want to convert to .pdf PRESERVING THE OCR so that I can read and annotate them on iPad.

Now, I can of course open the djvu with ABBYY Finereader: it will scan the whole document and read the text, usually doing a very good job.

BUT. When I produce the OCRd .pdf, it will be a 'copy' of the original text, not the page-as-it-was. In other words: I don't want to have a 're-typed' copy of the book (also because ABBYY does an awful job with numbered footnotes), I want to keep the EXACT same looks of the printed book (font, spacings...everything).

I can achieve this if I simply 'print' the djvu file as a .pdf of course. But if I do this, I lose the searchable text, it will just be an image.

So the question would be: Is there any way to convert a djvu file, preserving BOTH ORCd text (searchability) AND general outlook?

Thank you!

DSpider · 10-25-2011, 10:00 AM

AFAIK, FineReader opens DjVu files as JPG images (well, perhaps not JPG, but something close, maybe a compressed TIFF, I'm not really sure).

The quick and dirty method would be to have the OCR text under the image. In ABBYY FineReader 10 (haven't tried 11 yet), you can export PDFs with Text under the page image:

However, this wouldn't be any better than your average DjVu file with text underneath. The quality method, would be to properly proof-read after OCR-ing which takes time and patience - it basically means you read the whole book using ABBYY FineReader once, and once more the final version using Foxit Reader, Adobe Reader, Apple's iBooks, etc.

I think retouching should be done either using Word 2010 SP1 (for .docx), or LibreOffice 3.4.3 (for .rtf), the latest right now. Not some shoddy/half-assed text processing program.

If you want to preserve "the EXACT same look of the printed book", font matching can be a pain sometimes, especially since most publishers use commercial fonts. But it's totally worth it if you do it right.

I mean, sure, you could use a close match using one of the websites bellow. Or, if you're willing to go the extra mile track down the commercial variants (which I think is probably called piracy - but hey, you're not making any money off of it... are you?).

http://www.identifont.com
http://www.whatthefont.com
http://www.whatfontis.com

As a last resort, ask for someone's help: http://typophile.com/typeid

Hope this helps! Some Word 2010 training videos wouldn't hurt either. Learn to make use of macros instead of editing character spacing by hand. It would save a lot (and I mean A LOT) of time. Hook them up to hotkeys instead of right clicking - Font - Advanced... etc. etc.

BranMakMorn · 10-25-2011, 03:56 PM

Thanks a lot man, this was one thorough reply. I don't have time to try it out right now, but I certainly will tomorrow.

Thanks a lot!

Toxaris · 10-26-2011, 01:17 AM

It is even easier. Save the file in ABBYY as PDF/A.

10-25-2011, 07:29 AM	#1
BranMakMorn Enthusiast Posts: 30 Karma: 10 Join Date: Jan 2010 Device: none	Question about OCRd djvu and pdf with ABBYY Hello everyone, I've got a nagging problem which I didn't manage to solve browsing this section of the forum. So here it is: I have some books in .djvu format that I want to convert to .pdf PRESERVING THE OCR so that I can read and annotate them on iPad. Now, I can of course open the djvu with ABBYY Finereader: it will scan the whole document and read the text, usually doing a very good job. BUT. When I produce the OCRd .pdf, it will be a 'copy' of the original text, not the page-as-it-was. In other words: I don't want to have a 're-typed' copy of the book (also because ABBYY does an awful job with numbered footnotes), I want to keep the EXACT same looks of the printed book (font, spacings...everything). I can achieve this if I simply 'print' the djvu file as a .pdf of course. But if I do this, I lose the searchable text, it will just be an image. So the question would be: Is there any way to convert a djvu file, preserving BOTH ORCd text (searchability) AND general outlook? Thank you!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[Old Thread] Removing ABBYY header in a PDF	robertlc	Conversion	33	09-09-2011 12:12 AM
PRS-650 OCR software/Abbyy Finereader-Highlighting –Export pdf w.notes, highlighted passages	wonderose	Sony Reader	4	04-27-2011 10:41 PM
Any way to open a PDF in ABBYY 9.0 without actually processing the pages?	Ea	Workshop	3	03-07-2010 05:52 AM
Ignore Headers & Footers in PDF when scanning in ABBYY	PieOPah	Workshop	5	08-28-2009 01:55 AM
Strikethrough in ABBYY/PDF	eurotrash	Workshop	5	10-29-2008 01:44 PM

10-25-2011, 10:00 AM	#2
DSpider Evangelist Posts: 450 Karma: 343115 Join Date: Nov 2009 Location: Romania Device: PW2 2014	AFAIK, FineReader opens DjVu files as JPG images (well, perhaps not JPG, but something close, maybe a compressed TIFF, I'm not really sure). The quick and dirty method would be to have the OCR text under the image. In ABBYY FineReader 10 (haven't tried 11 yet), you can export PDFs with Text under the page image: However, this wouldn't be any better than your average DjVu file with text underneath. The quality method, would be to properly proof-read after OCR-ing which takes time and patience - it basically means you read the whole book using ABBYY FineReader once, and once more the final version using Foxit Reader, Adobe Reader, Apple's iBooks, etc. I think retouching should be done either using Word 2010 SP1 (for .docx), or LibreOffice 3.4.3 (for .rtf), the latest right now. Not some shoddy/half-assed text processing program. If you want to preserve "the EXACT same look of the printed book", font matching can be a pain sometimes, especially since most publishers use commercial fonts. But it's totally worth it if you do it right. I mean, sure, you could use a close match using one of the websites bellow. Or, if you're willing to go the extra mile track down the commercial variants (which I think is probably called piracy - but hey, you're not making any money off of it... are you?). http://www.identifont.com http://www.whatthefont.com http://www.whatfontis.com As a last resort, ask for someone's help: http://typophile.com/typeid Hope this helps! Some Word 2010 training videos wouldn't hurt either. Learn to make use of macros instead of editing character spacing by hand. It would save a lot (and I mean A LOT) of time. Hook them up to hotkeys instead of right clicking - Font - Advanced... etc. etc.

10-25-2011, 03:56 PM	#3
BranMakMorn Enthusiast Posts: 30 Karma: 10 Join Date: Jan 2010 Device: none	Thanks a lot man, this was one thorough reply. I don't have time to try it out right now, but I certainly will tomorrow. Thanks a lot!

10-26-2011, 01:17 AM	#4
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	It is even easier. Save the file in ABBYY as PDF/A.