10-25-2011, 07:29 AM | #1 |
Enthusiast
Posts: 30
Karma: 10
Join Date: Jan 2010
Device: none
|
Question about OCRd djvu and pdf with ABBYY
Hello everyone,
I've got a nagging problem which I didn't manage to solve browsing this section of the forum. So here it is: I have some books in .djvu format that I want to convert to .pdf PRESERVING THE OCR so that I can read and annotate them on iPad. Now, I can of course open the djvu with ABBYY Finereader: it will scan the whole document and read the text, usually doing a very good job. BUT. When I produce the OCRd .pdf, it will be a 'copy' of the original text, not the page-as-it-was. In other words: I don't want to have a 're-typed' copy of the book (also because ABBYY does an awful job with numbered footnotes), I want to keep the EXACT same looks of the printed book (font, spacings...everything). I can achieve this if I simply 'print' the djvu file as a .pdf of course. But if I do this, I lose the searchable text, it will just be an image. So the question would be: Is there any way to convert a djvu file, preserving BOTH ORCd text (searchability) AND general outlook? Thank you! |
10-25-2011, 10:00 AM | #2 |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
AFAIK, FineReader opens DjVu files as JPG images (well, perhaps not JPG, but something close, maybe a compressed TIFF, I'm not really sure).
The quick and dirty method would be to have the OCR text under the image. In ABBYY FineReader 10 (haven't tried 11 yet), you can export PDFs with Text under the page image: However, this wouldn't be any better than your average DjVu file with text underneath. The quality method, would be to properly proof-read after OCR-ing which takes time and patience - it basically means you read the whole book using ABBYY FineReader once, and once more the final version using Foxit Reader, Adobe Reader, Apple's iBooks, etc. I think retouching should be done either using Word 2010 SP1 (for .docx), or LibreOffice 3.4.3 (for .rtf), the latest right now. Not some shoddy/half-assed text processing program. If you want to preserve "the EXACT same look of the printed book", font matching can be a pain sometimes, especially since most publishers use commercial fonts. But it's totally worth it if you do it right. I mean, sure, you could use a close match using one of the websites bellow. Or, if you're willing to go the extra mile track down the commercial variants (which I think is probably called piracy - but hey, you're not making any money off of it... are you?). http://www.identifont.com http://www.whatthefont.com http://www.whatfontis.com As a last resort, ask for someone's help: http://typophile.com/typeid Hope this helps! Some Word 2010 training videos wouldn't hurt either. Learn to make use of macros instead of editing character spacing by hand. It would save a lot (and I mean A LOT) of time. Hook them up to hotkeys instead of right clicking - Font - Advanced... etc. etc. |
10-25-2011, 03:56 PM | #3 |
Enthusiast
Posts: 30
Karma: 10
Join Date: Jan 2010
Device: none
|
Thanks a lot man, this was one thorough reply. I don't have time to try it out right now, but I certainly will tomorrow.
Thanks a lot! |
10-26-2011, 01:17 AM | #4 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
It is even easier. Save the file in ABBYY as PDF/A.
|
Tags |
abbyy, conversion, dvju, ocr, pdf |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[Old Thread] Removing ABBYY header in a PDF | robertlc | Conversion | 33 | 09-09-2011 12:12 AM |
PRS-650 OCR software/Abbyy Finereader-Highlighting –Export pdf w.notes, highlighted passages | wonderose | Sony Reader | 4 | 04-27-2011 10:41 PM |
Any way to open a PDF in ABBYY 9.0 without actually processing the pages? | Ea | Workshop | 3 | 03-07-2010 05:52 AM |
Ignore Headers & Footers in PDF when scanning in ABBYY | PieOPah | Workshop | 5 | 08-28-2009 01:55 AM |
Strikethrough in ABBYY/PDF | eurotrash | Workshop | 5 | 10-29-2008 01:44 PM |