Remove ocr backround

Schauberger · 08-10-2012, 11:36 PM

Hi,

I recently ocr'd a scanned, pdf book using the ocr function of the PDF-XChange Viewer. However, it works by layering the the recognized text over the original document, which displays quite awkwardly on my Sony 650. Therefore, I am searching for a way to remove the backround layer/images of my document, and at the same time preserve the formatting, and TOC of my original file.
THanks in advance

!

Schauberger

DSpider · 08-11-2012, 09:14 AM

Never heard of "PDF-XChange Viewer" (for Windows I would recommend Adobe Reader or Foxit Reader), but for OCR-ing purposes you really should use something else. Try ABBYY FineReader.

https://www.mobileread.com/forums/sho...php?t=154638#2
https://www.mobileread.com/forums/sho...d.php?t=149448

Formatting (as in italics, bolds, etc) will be preserved, but the layout and TOC you'll probably have to do yourself.

Schauberger · 08-11-2012, 10:08 AM

Unfortunately, Finereader is quite costy, and the pdf x-change viewer ocr'd my document quite well (not to mention that the pdf x-change viewer is free!). However, the only problem I have with this software (and this seems to be the case with all ocr software) is that it overlays the recognized text over the original page, sort of akin to the thhread you referenced to https://www.mobileread.com/forums/sho...d.php?t=149448. What I am basically trying to do, is copy the text from my document, and preserve the page format.
Anyway, Thanks for the input!

frostschutz · 08-11-2012, 10:38 AM

Can you upload an example document somewhere. Unfortunately with PDF there's a thousand ways this could be (or could not be) done.

Best option would be of course if your software had a setting somewhere to turn off this unwanted behaviour.

Schauberger · 08-11-2012, 12:16 PM

here is an example - http://www42.zippyshare.com/v/30355722/file.html

DSpider · 08-11-2012, 01:03 PM

Have you tried copy-pasting it? ... You really wanna work with that garbage? Granted, the font looks a bit condensed and some level of difficulty in OCR-ing may be there, but are you SURE you want to get rid of the image? It will reduce the filesize considerably, yes, but you also need to prepare for an awful lot of proofreading and the possibility that you may miss a comma or two and merge sentences together by accident.

Schauberger · 08-11-2012, 01:16 PM

What I have uploaded is merely a generic example. My original, multipage file, was way to large to upload. And to answer your question, yes, I have tried copy/paste but that method fails to preserve the page layout. And yes, I am sure that I want to get rid of the image.

DSpider · 08-11-2012, 04:29 PM

I meant copy-pasting in order to check the quality of the OCR. Don't copy-paste; that's just wrong. It's like saving an e-book in .txt format - there's pretty much no formatting/styling. Anyway, I meant that the OCR is pure garbage (which I guess is to be expected from a PDF viewer). Editing PDFs is also a bad idea. PDF is intended as a final destination, not something that can easily be converted - unless, of course, it's a tagged PDF. But that's a different story.

Schauberger · 08-11-2012, 05:07 PM

Quote:

Originally Posted by DSpider

I meant copy-pasting in order to check the quality of the OCR. Don't copy-paste; that's just wrong. It's like saving an e-book in .txt format - there's pretty much no formatting/styling. Anyway, I
meant that the OCR is pure garbage (which I guess is to be expected from a PDF viewer). Editing PDFs is also a bad idea. PDF is intended as a final destination, not something that can easily be converted - unless, of course, it's a tagged PDF. But that's a different story.

The file I previously uploaded, was NOT my original file. I'm sorry I uploaded such a bad example. Here is a sample of my ORIGINAL file.

http://www48.zippyshare.com/v/33850336/file.html

The OCR quality of this file is pretty good.

Schauberger

PS - I am forced to edit my original file, because my ereader alternates between displaying the original pages, and the OCR'd text.

Jellby · 08-12-2012, 04:05 AM

You could try one of these:

http://www.nitropdf.com/help/delete_pdf_images.htm
http://stackoverflow.com/questions/6...t-only-in-java
http://stackoverflow.com/questions/6...mages-from-pdf
http://www.aspose.com/docs/display/p...m+the+PDF+File

With pdftk you can uncompress the pdf, leaving you with a text file that you can edit if you can understand the language... it should be possible to remove the images there, or make them transparent, or move them out of the page...

ApK · 08-24-2012, 03:24 PM

Quote:

Originally Posted by Schauberger

Hi,

I recently ocr'd a scanned, pdf book using the ocr function of the PDF-XChange Viewer. However, it works by layering the the recognized text over the original document, which displays quite awkwardly on my Sony 650. Therefore, I am searching for a way to remove the backround layer/images of my document, and at the same time preserve the formatting, and TOC of my original file.
THanks in advance

!

Schauberger

I use PDF-XChange free version as my reader but have never tried it for OCRing. I guess that's a PRO feature?
PDFs can be created as either text-over-image, or image-over-text, so there might be an option somewhere?

Also, in that file you linked, I did not see the text over the image (was I supposed to?) so perhaps it's just a viewer setting that is showing it on the reader?

Or am I misunderstanding the issue?

Schauberger · 08-24-2012, 07:37 PM

You only need the latest version of the PDF-XChange Viewer for OCRing, you don't need the PRO version.

The file I linked to is formatted as text-over-image, and for some reason the program renders the text invisible. I know that text is there because I can copy/paste, although it doesn't preserve the formatting.

What I am trying to do is either extract the text and preserve formatting, or remove the image layer of my document.

Schauberger

willus · 09-15-2012, 11:07 AM

Quote:

Originally Posted by Schauberger

What I am trying to do is either extract the text and preserve formatting, or remove the image layer of my document.

Schauberger

Did you ever figure this out? What system are you running on? Mac? PC? I was able to make the OCR'd text visible and remove the bitmap from your PDF file using a couple tools that I have (see attached). The OCR is excellent. PDF X-change does a nice job.

frostschutz · 09-15-2012, 12:10 PM

Sorry for my lack of reply. This forum makes it hard to follow threads you've replied to...

(unless you want to be bombarded with notification mails)

Here's what you end up with when you remove the image from your sample.pdf. It's a blank page. But you can still select and copy text out of it. I haven't tested it on the eReader, but with some luck the text will show up when you use Reflow.

The original PDF is really just an image (2079x2840 px) with a text layer on top that uses an "invisible font". Not sure if it would be possible to make the font visible to get somewhat of an image in the original layout back - the result would not look good though.

What could be done is a resized image, since the current one is too large for eReaders. I attached that too. Of course the quality is horrible.

Resizing was done with GhostScript; removal of the image with qpdf (convert pdf to qdf) and vim. I'm sure there are better tools... but is this result useful at all?

If the goal is reflow you could just as well convert it to txt in the first place, as that's really all there is once you remove the image.

copyrite · 09-15-2012, 12:36 PM

Quote:

Originally Posted by frostschutz

Sorry for my lack of reply. This forum makes it hard to follow threads you've replied to...

(unless you want to be bombarded with notification mails)

Easy peasy... click on User CP in the blue menu (just under the graphic of the forum name) or favorite this link; all of your subscribed threads are listed. You can subscribe without receiving notifications, to make that your default click here, look for the Default Thread Subscription Mode section.

08-10-2012, 11:36 PM	#1
Schauberger Biotechnologist Posts: 38 Karma: 499330 Join Date: Jun 2009 Device: 1st Gen Kindle; Sony PRS-T1	Remove ocr backround Hi, I recently ocr'd a scanned, pdf book using the ocr function of the PDF-XChange Viewer. However, it works by layering the the recognized text over the original document, which displays quite awkwardly on my Sony 650. Therefore, I am searching for a way to remove the backround layer/images of my document, and at the same time preserve the formatting, and TOC of my original file. THanks in advance ! Schauberger

08-11-2012, 10:38 AM	#4
frostschutz Linux User Posts: 2,279 Karma: 6123806 Join Date: Sep 2010 Location: Heidelberg, Germany Device: none	Can you upload an example document somewhere. Unfortunately with PDF there's a thousand ways this could be (or could not be) done. Best option would be of course if your software had a setting somewhere to turn off this unwanted behaviour. Last edited by frostschutz; 08-11-2012 at 10:40 AM.

08-11-2012, 01:16 PM	#7
Schauberger Biotechnologist Posts: 38 Karma: 499330 Join Date: Jun 2009 Device: 1st Gen Kindle; Sony PRS-T1	What I have uploaded is merely a generic example. My original, multipage file, was way to large to upload. And to answer your question, yes, I have tried copy/paste but that method fails to preserve the page layout. And yes, I am sure that I want to get rid of the image. Last edited by Schauberger; 08-11-2012 at 03:26 PM.

08-24-2012, 07:37 PM	#12
Schauberger Biotechnologist Posts: 38 Karma: 499330 Join Date: Jun 2009 Device: 1st Gen Kindle; Sony PRS-T1	You only need the latest version of the PDF-XChange Viewer for OCRing, you don't need the PRO version. The file I linked to is formatted as text-over-image, and for some reason the program renders the text invisible. I know that text is there because I can copy/paste, although it doesn't preserve the formatting. What I am trying to do is either extract the text and preserve formatting, or remove the image layer of my document. Schauberger Last edited by Schauberger; 08-24-2012 at 07:40 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
remove OCR from a PDF?	soondai	PDF	9	10-08-2011 12:42 PM
How to convert an OCR file to a Non-OCR one	res9282	PDF	1	08-05-2011 05:58 AM
Backround color?	paquitz	Calibre	3	11-21-2010 09:20 PM
RFE: Remove remove tags in bulk edit	magphil	Calibre	0	08-11-2009 10:37 AM
White text on black backround?	Fingers	Which one should I buy?	7	12-21-2007 12:19 PM

08-11-2012, 09:14 AM	#2
DSpider Evangelist Posts: 450 Karma: 343115 Join Date: Nov 2009 Location: Romania Device: PW2 2014	Never heard of "PDF-XChange Viewer" (for Windows I would recommend Adobe Reader or Foxit Reader), but for OCR-ing purposes you really should use something else. Try ABBYY FineReader. https://www.mobileread.com/forums/sho...php?t=154638#2 https://www.mobileread.com/forums/sho...d.php?t=149448 Formatting (as in italics, bolds, etc) will be preserved, but the layout and TOC you'll probably have to do yourself.

08-11-2012, 10:08 AM	#3
Schauberger Biotechnologist Posts: 38 Karma: 499330 Join Date: Jun 2009 Device: 1st Gen Kindle; Sony PRS-T1	Unfortunately, Finereader is quite costy, and the pdf x-change viewer ocr'd my document quite well (not to mention that the pdf x-change viewer is free!). However, the only problem I have with this software (and this seems to be the case with all ocr software) is that it overlays the recognized text over the original page, sort of akin to the thhread you referenced to https://www.mobileread.com/forums/sho...d.php?t=149448. What I am basically trying to do, is copy the text from my document, and preserve the page format. Anyway, Thanks for the input!

08-11-2012, 12:16 PM	#5
Schauberger Biotechnologist Posts: 38 Karma: 499330 Join Date: Jun 2009 Device: 1st Gen Kindle; Sony PRS-T1	here is an example - http://www42.zippyshare.com/v/30355722/file.html

08-11-2012, 01:03 PM	#6
DSpider Evangelist Posts: 450 Karma: 343115 Join Date: Nov 2009 Location: Romania Device: PW2 2014	Have you tried copy-pasting it? ... You really wanna work with that garbage? Granted, the font looks a bit condensed and some level of difficulty in OCR-ing may be there, but are you SURE you want to get rid of the image? It will reduce the filesize considerably, yes, but you also need to prepare for an awful lot of proofreading and the possibility that you may miss a comma or two and merge sentences together by accident.

08-11-2012, 04:29 PM	#8
DSpider Evangelist Posts: 450 Karma: 343115 Join Date: Nov 2009 Location: Romania Device: PW2 2014	I meant copy-pasting in order to check the quality of the OCR. Don't copy-paste; that's just wrong. It's like saving an e-book in .txt format - there's pretty much no formatting/styling. Anyway, I meant that the OCR is pure garbage (which I guess is to be expected from a PDF viewer). Editing PDFs is also a bad idea. PDF is intended as a final destination, not something that can easily be converted - unless, of course, it's a tagged PDF. But that's a different story.

08-12-2012, 04:05 AM	#10
Jellby frumious Bandersnatch Posts: 7,515 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	You could try one of these: http://www.nitropdf.com/help/delete_pdf_images.htm http://stackoverflow.com/questions/6...t-only-in-java http://stackoverflow.com/questions/6...mages-from-pdf http://www.aspose.com/docs/display/p...m+the+PDF+File With pdftk you can uncompress the pdf, leaving you with a text file that you can edit if you can understand the language... it should be possible to remove the images there, or make them transparent, or move them out of the page...