08-10-2012, 11:36 PM | #1 |
Biotechnologist
Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
|
Remove ocr backround
Hi,
I recently ocr'd a scanned, pdf book using the ocr function of the PDF-XChange Viewer. However, it works by layering the the recognized text over the original document, which displays quite awkwardly on my Sony 650. Therefore, I am searching for a way to remove the backround layer/images of my document, and at the same time preserve the formatting, and TOC of my original file. THanks in advance ! Schauberger |
08-11-2012, 09:14 AM | #2 |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
Never heard of "PDF-XChange Viewer" (for Windows I would recommend Adobe Reader or Foxit Reader), but for OCR-ing purposes you really should use something else. Try ABBYY FineReader.
https://www.mobileread.com/forums/sho...php?t=154638#2 https://www.mobileread.com/forums/sho...d.php?t=149448 Formatting (as in italics, bolds, etc) will be preserved, but the layout and TOC you'll probably have to do yourself. |
08-11-2012, 10:08 AM | #3 |
Biotechnologist
Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
|
Unfortunately, Finereader is quite costy, and the pdf x-change viewer ocr'd my document quite well (not to mention that the pdf x-change viewer is free!). However, the only problem I have with this software (and this seems to be the case with all ocr software) is that it overlays the recognized text over the original page, sort of akin to the thhread you referenced to https://www.mobileread.com/forums/sho...d.php?t=149448. What I am basically trying to do, is copy the text from my document, and preserve the page format.
Anyway, Thanks for the input! |
08-11-2012, 10:38 AM | #4 |
Linux User
Posts: 2,279
Karma: 6123806
Join Date: Sep 2010
Location: Heidelberg, Germany
Device: none
|
Can you upload an example document somewhere. Unfortunately with PDF there's a thousand ways this could be (or could not be) done.
Best option would be of course if your software had a setting somewhere to turn off this unwanted behaviour. Last edited by frostschutz; 08-11-2012 at 10:40 AM. |
08-11-2012, 12:16 PM | #5 |
Biotechnologist
Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
|
here is an example - http://www42.zippyshare.com/v/30355722/file.html
|
08-11-2012, 01:03 PM | #6 |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
Have you tried copy-pasting it? ... You really wanna work with that garbage? Granted, the font looks a bit condensed and some level of difficulty in OCR-ing may be there, but are you SURE you want to get rid of the image? It will reduce the filesize considerably, yes, but you also need to prepare for an awful lot of proofreading and the possibility that you may miss a comma or two and merge sentences together by accident.
|
08-11-2012, 01:16 PM | #7 |
Biotechnologist
Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
|
What I have uploaded is merely a generic example. My original, multipage file, was way to large to upload. And to answer your question, yes, I have tried copy/paste but that method fails to preserve the page layout. And yes, I am sure that I want to get rid of the image.
Last edited by Schauberger; 08-11-2012 at 03:26 PM. |
08-11-2012, 04:29 PM | #8 |
Evangelist
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
I meant copy-pasting in order to check the quality of the OCR. Don't copy-paste; that's just wrong. It's like saving an e-book in .txt format - there's pretty much no formatting/styling. Anyway, I meant that the OCR is pure garbage (which I guess is to be expected from a PDF viewer). Editing PDFs is also a bad idea. PDF is intended as a final destination, not something that can easily be converted - unless, of course, it's a tagged PDF. But that's a different story.
|
08-11-2012, 05:07 PM | #9 | |
Biotechnologist
Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
|
Quote:
http://www48.zippyshare.com/v/33850336/file.html The OCR quality of this file is pretty good. Schauberger PS - I am forced to edit my original file, because my ereader alternates between displaying the original pages, and the OCR'd text. |
|
08-12-2012, 04:05 AM | #10 |
frumious Bandersnatch
Posts: 7,515
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
You could try one of these:
http://www.nitropdf.com/help/delete_pdf_images.htm http://stackoverflow.com/questions/6...t-only-in-java http://stackoverflow.com/questions/6...mages-from-pdf http://www.aspose.com/docs/display/p...m+the+PDF+File With pdftk you can uncompress the pdf, leaving you with a text file that you can edit if you can understand the language... it should be possible to remove the images there, or make them transparent, or move them out of the page... |
08-24-2012, 03:24 PM | #11 | |
Award-Winning Participant
Posts: 7,316
Karma: 67862884
Join Date: Feb 2010
Location: NJ, USA
Device: Kindle
|
Quote:
PDFs can be created as either text-over-image, or image-over-text, so there might be an option somewhere? Also, in that file you linked, I did not see the text over the image (was I supposed to?) so perhaps it's just a viewer setting that is showing it on the reader? Or am I misunderstanding the issue? |
|
08-24-2012, 07:37 PM | #12 |
Biotechnologist
Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
|
You only need the latest version of the PDF-XChange Viewer for OCRing, you don't need the PRO version.
The file I linked to is formatted as text-over-image, and for some reason the program renders the text invisible. I know that text is there because I can copy/paste, although it doesn't preserve the formatting. What I am trying to do is either extract the text and preserve formatting, or remove the image layer of my document. Schauberger Last edited by Schauberger; 08-24-2012 at 07:40 PM. |
09-15-2012, 11:07 AM | #13 |
Fuzzball, the purple cat
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
Did you ever figure this out? What system are you running on? Mac? PC? I was able to make the OCR'd text visible and remove the bitmap from your PDF file using a couple tools that I have (see attached). The OCR is excellent. PDF X-change does a nice job.
Last edited by willus; 09-16-2012 at 12:57 AM. Reason: Used the more recent sample |
09-15-2012, 12:10 PM | #14 |
Linux User
Posts: 2,279
Karma: 6123806
Join Date: Sep 2010
Location: Heidelberg, Germany
Device: none
|
Sorry for my lack of reply. This forum makes it hard to follow threads you've replied to... (unless you want to be bombarded with notification mails)
Here's what you end up with when you remove the image from your sample.pdf. It's a blank page. But you can still select and copy text out of it. I haven't tested it on the eReader, but with some luck the text will show up when you use Reflow. The original PDF is really just an image (2079x2840 px) with a text layer on top that uses an "invisible font". Not sure if it would be possible to make the font visible to get somewhat of an image in the original layout back - the result would not look good though. What could be done is a resized image, since the current one is too large for eReaders. I attached that too. Of course the quality is horrible. Resizing was done with GhostScript; removal of the image with qpdf (convert pdf to qdf) and vim. I'm sure there are better tools... but is this result useful at all? If the goal is reflow you could just as well convert it to txt in the first place, as that's really all there is once you remove the image. |
09-15-2012, 12:36 PM | #15 | |
Wizard
Posts: 1,814
Karma: 4985051
Join Date: Sep 2010
Location: Maryland
Device: ...lots! ;) mostly reading on a Kindle Voyage
|
Quote:
Last edited by copyrite; 09-15-2012 at 12:40 PM. Reason: 'tis a work in progress LOL |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
remove OCR from a PDF? | soondai | 9 | 10-08-2011 12:42 PM | |
How to convert an OCR file to a Non-OCR one | res9282 | 1 | 08-05-2011 05:58 AM | |
Backround color? | paquitz | Calibre | 3 | 11-21-2010 09:20 PM |
RFE: Remove remove tags in bulk edit | magphil | Calibre | 0 | 08-11-2009 10:37 AM |
White text on black backround? | Fingers | Which one should I buy? | 7 | 12-21-2007 12:19 PM |