|
|
#1 |
|
Biotechnologist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
|
Remove ocr backround
I recently ocr'd a scanned, pdf book using the ocr function of the PDF-XChange Viewer. However, it works by layering the the recognized text over the original document, which displays quite awkwardly on my Sony 650. Therefore, I am searching for a way to remove the backround layer/images of my document, and at the same time preserve the formatting, and TOC of my original file. THanks in advance !Schauberger |
|
|
|
|
|
#2 |
|
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 368
Karma: 298951
Join Date: Nov 2009
Location: Romania
Device: iPod touch 2G (16 GB)
|
Never heard of "PDF-XChange Viewer" (for Windows I would recommend Adobe Reader or Foxit Reader), but for OCR-ing purposes you really should use something else. Try ABBYY FineReader.
http://www.mobileread.com/forums/sho...php?t=154638#2 http://www.mobileread.com/forums/sho...d.php?t=149448 Formatting (as in italics, bolds, etc) will be preserved, but the layout and TOC you'll probably have to do yourself. |
|
|
|
|
Enthusiast
|
|
|
|
#3 |
|
Biotechnologist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
|
Unfortunately, Finereader is quite costy, and the pdf x-change viewer ocr'd my document quite well (not to mention that the pdf x-change viewer is free!). However, the only problem I have with this software (and this seems to be the case with all ocr software) is that it overlays the recognized text over the original page, sort of akin to the thhread you referenced to http://www.mobileread.com/forums/sho...d.php?t=149448. What I am basically trying to do, is copy the text from my document, and preserve the page format.
Anyway, Thanks for the input! |
|
|
|
|
|
#4 |
|
Linux User
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 689
Karma: 1861395
Join Date: Sep 2010
Device: iriver Story HD
|
Can you upload an example document somewhere. Unfortunately with PDF there's a thousand ways this could be (or could not be) done.
Best option would be of course if your software had a setting somewhere to turn off this unwanted behaviour.
__________________
addicted to Fantasy
Last edited by frostschutz; 08-11-2012 at 10:40 AM. |
|
|
|
|
|
#5 |
|
Biotechnologist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
|
here is an example - http://www42.zippyshare.com/v/30355722/file.html
|
|
|
|
|
|
#6 |
|
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 368
Karma: 298951
Join Date: Nov 2009
Location: Romania
Device: iPod touch 2G (16 GB)
|
Have you tried copy-pasting it? ... You really wanna work with that garbage? Granted, the font looks a bit condensed and some level of difficulty in OCR-ing may be there, but are you SURE you want to get rid of the image? It will reduce the filesize considerably, yes, but you also need to prepare for an awful lot of proofreading and the possibility that you may miss a comma or two and merge sentences together by accident.
|
|
|
|
|
|
#7 |
|
Biotechnologist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
|
What I have uploaded is merely a generic example. My original, multipage file, was way to large to upload. And to answer your question, yes, I have tried copy/paste but that method fails to preserve the page layout. And yes, I am sure that I want to get rid of the image.
Last edited by Schauberger; 08-11-2012 at 03:26 PM. |
|
|
|
|
|
#8 |
|
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 368
Karma: 298951
Join Date: Nov 2009
Location: Romania
Device: iPod touch 2G (16 GB)
|
I meant copy-pasting in order to check the quality of the OCR. Don't copy-paste; that's just wrong. It's like saving an e-book in .txt format - there's pretty much no formatting/styling. Anyway, I meant that the OCR is pure garbage (which I guess is to be expected from a PDF viewer). Editing PDFs is also a bad idea. PDF is intended as a final destination, not something that can easily be converted - unless, of course, it's a tagged PDF. But that's a different story.
|
|
|
|
|
|
#9 | |
|
Biotechnologist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
|
Quote:
http://www48.zippyshare.com/v/33850336/file.html The OCR quality of this file is pretty good. Schauberger PS - I am forced to edit my original file, because my ereader alternates between displaying the original pages, and the OCR'd text. |
|
|
|
|
|
|
#10 |
|
frumious Bandersnatch
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,141
Karma: 2474345
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon
|
You could try one of these:
http://www.nitropdf.com/help/delete_pdf_images.htm http://stackoverflow.com/questions/6...t-only-in-java http://stackoverflow.com/questions/6...mages-from-pdf http://www.aspose.com/docs/display/p...m+the+PDF+File With pdftk you can uncompress the pdf, leaving you with a text file that you can edit if you can understand the language... it should be possible to remove the images there, or make them transparent, or move them out of the page... |
|
|
|
|
|
#11 | |
|
What did you call me?
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,158
Karma: 15138003
Join Date: Feb 2010
Location: NJ, USA
Device: Kindle
|
Quote:
PDFs can be created as either text-over-image, or image-over-text, so there might be an option somewhere? Also, in that file you linked, I did not see the text over the image (was I supposed to?) so perhaps it's just a viewer setting that is showing it on the reader? Or am I misunderstanding the issue?
__________________
Join me in getting money for nothing: Join Ebates. Use my referral link, and we both get free money! |
|
|
|
|
|
|
#12 |
|
Biotechnologist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
|
You only need the latest version of the PDF-XChange Viewer for OCRing, you don't need the PRO version.
The file I linked to is formatted as text-over-image, and for some reason the program renders the text invisible. I know that text is there because I can copy/paste, although it doesn't preserve the formatting. What I am trying to do is either extract the text and preserve formatting, or remove the image layer of my document. Schauberger Last edited by Schauberger; 08-24-2012 at 07:40 PM. |
|
|
|
|
|
#13 |
|
.
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 255
Karma: 1191602
Join Date: Jun 2011
Location: California
Device: Kindle 2, iPad
|
Did you ever figure this out? What system are you running on? Mac? PC? I was able to make the OCR'd text visible and remove the bitmap from your PDF file using a couple tools that I have (see attached). The OCR is excellent. PDF X-change does a nice job.
Last edited by willus; 09-16-2012 at 12:57 AM. Reason: Used the more recent sample |
|
|
|
|
|
#14 |
|
Linux User
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 689
Karma: 1861395
Join Date: Sep 2010
Device: iriver Story HD
|
Sorry for my lack of reply. This forum makes it hard to follow threads you've replied to...
(unless you want to be bombarded with notification mails)Here's what you end up with when you remove the image from your sample.pdf. It's a blank page. But you can still select and copy text out of it. I haven't tested it on the eReader, but with some luck the text will show up when you use Reflow. The original PDF is really just an image (2079x2840 px) with a text layer on top that uses an "invisible font". Not sure if it would be possible to make the font visible to get somewhat of an image in the original layout back - the result would not look good though. What could be done is a resized image, since the current one is too large for eReaders. I attached that too. Of course the quality is horrible. Resizing was done with GhostScript; removal of the image with qpdf (convert pdf to qdf) and vim. I'm sure there are better tools... but is this result useful at all? If the goal is reflow you could just as well convert it to txt in the first place, as that's really all there is once you remove the image.
__________________
addicted to Fantasy
|
|
|
|
|
|
#15 | |
|
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 637
Karma: 1167579
Join Date: Sep 2010
Location: Maryland
Device: iTouch, HTC Incredible 2, K3K, PW, Kindle Fire 2
|
Quote:
__________________
Outside of a dog, a book is a man's best friend. Inside of a dog it's too dark to read. - Groucho Marx Last edited by copyrite; 09-15-2012 at 12:40 PM. Reason: 'tis a work in progress LOL |
|
|
|
|
![]() |
| Thread Tools | Search this Thread |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| remove OCR from a PDF? | soondai | 9 | 10-08-2011 12:42 PM | |
| How to convert an OCR file to a Non-OCR one | res9282 | 1 | 08-05-2011 05:58 AM | |
| Backround color? | paquitz | Calibre | 3 | 11-21-2010 09:20 PM |
| RFE: Remove remove tags in bulk edit | magphil | Calibre | 0 | 08-11-2009 10:37 AM |
| White text on black backround? | Fingers | Which one should I buy? | 7 | 12-21-2007 12:19 PM |