Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 08-10-2012, 11:36 PM   #1
Schauberger
Biotechnologist
Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.
 
Schauberger's Avatar
 
Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
Remove ocr backround

Hi,

I recently ocr'd a scanned, pdf book using the ocr function of the PDF-XChange Viewer. However, it works by layering the the recognized text over the original document, which displays quite awkwardly on my Sony 650. Therefore, I am searching for a way to remove the backround layer/images of my document, and at the same time preserve the formatting, and TOC of my original file.
THanks in advance !


Schauberger
Schauberger is offline   Reply With Quote
Old 08-11-2012, 09:14 AM   #2
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
Never heard of "PDF-XChange Viewer" (for Windows I would recommend Adobe Reader or Foxit Reader), but for OCR-ing purposes you really should use something else. Try ABBYY FineReader.

https://www.mobileread.com/forums/sho...php?t=154638#2
https://www.mobileread.com/forums/sho...d.php?t=149448


Formatting (as in italics, bolds, etc) will be preserved, but the layout and TOC you'll probably have to do yourself.
DSpider is offline   Reply With Quote
Advert
Old 08-11-2012, 10:08 AM   #3
Schauberger
Biotechnologist
Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.
 
Schauberger's Avatar
 
Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
Unfortunately, Finereader is quite costy, and the pdf x-change viewer ocr'd my document quite well (not to mention that the pdf x-change viewer is free!). However, the only problem I have with this software (and this seems to be the case with all ocr software) is that it overlays the recognized text over the original page, sort of akin to the thhread you referenced to https://www.mobileread.com/forums/sho...d.php?t=149448. What I am basically trying to do, is copy the text from my document, and preserve the page format.
Anyway, Thanks for the input!
Schauberger is offline   Reply With Quote
Old 08-11-2012, 10:38 AM   #4
frostschutz
Linux User
frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.
 
frostschutz's Avatar
 
Posts: 2,279
Karma: 6123806
Join Date: Sep 2010
Location: Heidelberg, Germany
Device: none
Can you upload an example document somewhere. Unfortunately with PDF there's a thousand ways this could be (or could not be) done.

Best option would be of course if your software had a setting somewhere to turn off this unwanted behaviour.

Last edited by frostschutz; 08-11-2012 at 10:40 AM.
frostschutz is offline   Reply With Quote
Old 08-11-2012, 12:16 PM   #5
Schauberger
Biotechnologist
Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.
 
Schauberger's Avatar
 
Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
here is an example - http://www42.zippyshare.com/v/30355722/file.html
Schauberger is offline   Reply With Quote
Advert
Old 08-11-2012, 01:03 PM   #6
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
Have you tried copy-pasting it? ... You really wanna work with that garbage? Granted, the font looks a bit condensed and some level of difficulty in OCR-ing may be there, but are you SURE you want to get rid of the image? It will reduce the filesize considerably, yes, but you also need to prepare for an awful lot of proofreading and the possibility that you may miss a comma or two and merge sentences together by accident.
DSpider is offline   Reply With Quote
Old 08-11-2012, 01:16 PM   #7
Schauberger
Biotechnologist
Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.
 
Schauberger's Avatar
 
Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
What I have uploaded is merely a generic example. My original, multipage file, was way to large to upload. And to answer your question, yes, I have tried copy/paste but that method fails to preserve the page layout. And yes, I am sure that I want to get rid of the image.


Last edited by Schauberger; 08-11-2012 at 03:26 PM.
Schauberger is offline   Reply With Quote
Old 08-11-2012, 04:29 PM   #8
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
I meant copy-pasting in order to check the quality of the OCR. Don't copy-paste; that's just wrong. It's like saving an e-book in .txt format - there's pretty much no formatting/styling. Anyway, I meant that the OCR is pure garbage (which I guess is to be expected from a PDF viewer). Editing PDFs is also a bad idea. PDF is intended as a final destination, not something that can easily be converted - unless, of course, it's a tagged PDF. But that's a different story.
DSpider is offline   Reply With Quote
Old 08-11-2012, 05:07 PM   #9
Schauberger
Biotechnologist
Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.
 
Schauberger's Avatar
 
Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
Quote:
Originally Posted by DSpider View Post
I meant copy-pasting in order to check the quality of the OCR. Don't copy-paste; that's just wrong. It's like saving an e-book in .txt format - there's pretty much no formatting/styling. Anyway, I
meant that the OCR is pure garbage (which I guess is to be expected from a PDF viewer). Editing PDFs is also a bad idea. PDF is intended as a final destination, not something that can easily be converted - unless, of course, it's a tagged PDF. But that's a different story.
The file I previously uploaded, was NOT my original file. I'm sorry I uploaded such a bad example. Here is a sample of my ORIGINAL file.

http://www48.zippyshare.com/v/33850336/file.html

The OCR quality of this file is pretty good.


Schauberger


PS - I am forced to edit my original file, because my ereader alternates between displaying the original pages, and the OCR'd text.
Schauberger is offline   Reply With Quote
Old 08-12-2012, 04:05 AM   #10
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,515
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
You could try one of these:

http://www.nitropdf.com/help/delete_pdf_images.htm
http://stackoverflow.com/questions/6...t-only-in-java
http://stackoverflow.com/questions/6...mages-from-pdf
http://www.aspose.com/docs/display/p...m+the+PDF+File

With pdftk you can uncompress the pdf, leaving you with a text file that you can edit if you can understand the language... it should be possible to remove the images there, or make them transparent, or move them out of the page...
Jellby is offline   Reply With Quote
Old 08-24-2012, 03:24 PM   #11
ApK
Award-Winning Participant
ApK ought to be getting tired of karma fortunes by now.ApK ought to be getting tired of karma fortunes by now.ApK ought to be getting tired of karma fortunes by now.ApK ought to be getting tired of karma fortunes by now.ApK ought to be getting tired of karma fortunes by now.ApK ought to be getting tired of karma fortunes by now.ApK ought to be getting tired of karma fortunes by now.ApK ought to be getting tired of karma fortunes by now.ApK ought to be getting tired of karma fortunes by now.ApK ought to be getting tired of karma fortunes by now.ApK ought to be getting tired of karma fortunes by now.
 
Posts: 7,316
Karma: 67862884
Join Date: Feb 2010
Location: NJ, USA
Device: Kindle
Quote:
Originally Posted by Schauberger View Post
Hi,

I recently ocr'd a scanned, pdf book using the ocr function of the PDF-XChange Viewer. However, it works by layering the the recognized text over the original document, which displays quite awkwardly on my Sony 650. Therefore, I am searching for a way to remove the backround layer/images of my document, and at the same time preserve the formatting, and TOC of my original file.
THanks in advance !


Schauberger
I use PDF-XChange free version as my reader but have never tried it for OCRing. I guess that's a PRO feature?
PDFs can be created as either text-over-image, or image-over-text, so there might be an option somewhere?

Also, in that file you linked, I did not see the text over the image (was I supposed to?) so perhaps it's just a viewer setting that is showing it on the reader?

Or am I misunderstanding the issue?
ApK is offline   Reply With Quote
Old 08-24-2012, 07:37 PM   #12
Schauberger
Biotechnologist
Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.Schauberger ought to be getting tired of karma fortunes by now.
 
Schauberger's Avatar
 
Posts: 38
Karma: 499330
Join Date: Jun 2009
Device: 1st Gen Kindle; Sony PRS-T1
You only need the latest version of the PDF-XChange Viewer for OCRing, you don't need the PRO version.

The file I linked to is formatted as text-over-image, and for some reason the program renders the text invisible. I know that text is there because I can copy/paste, although it doesn't preserve the formatting.

What I am trying to do is either extract the text and preserve formatting, or remove the image layer of my document.

Schauberger

Last edited by Schauberger; 08-24-2012 at 07:40 PM.
Schauberger is offline   Reply With Quote
Old 09-15-2012, 11:07 AM   #13
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by Schauberger View Post
What I am trying to do is either extract the text and preserve formatting, or remove the image layer of my document.

Schauberger
Did you ever figure this out? What system are you running on? Mac? PC? I was able to make the OCR'd text visible and remove the bitmap from your PDF file using a couple tools that I have (see attached). The OCR is excellent. PDF X-change does a nice job.
Attached Files
File Type: pdf sample_visible_ocr_text_only.pdf (90.4 KB, 417 views)

Last edited by willus; 09-16-2012 at 12:57 AM. Reason: Used the more recent sample
willus is offline   Reply With Quote
Old 09-15-2012, 12:10 PM   #14
frostschutz
Linux User
frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.
 
frostschutz's Avatar
 
Posts: 2,279
Karma: 6123806
Join Date: Sep 2010
Location: Heidelberg, Germany
Device: none
Sorry for my lack of reply. This forum makes it hard to follow threads you've replied to... (unless you want to be bombarded with notification mails)

Here's what you end up with when you remove the image from your sample.pdf. It's a blank page. But you can still select and copy text out of it. I haven't tested it on the eReader, but with some luck the text will show up when you use Reflow.

The original PDF is really just an image (2079x2840 px) with a text layer on top that uses an "invisible font". Not sure if it would be possible to make the font visible to get somewhat of an image in the original layout back - the result would not look good though.

What could be done is a resized image, since the current one is too large for eReaders. I attached that too. Of course the quality is horrible.

Resizing was done with GhostScript; removal of the image with qpdf (convert pdf to qdf) and vim. I'm sure there are better tools... but is this result useful at all?

If the goal is reflow you could just as well convert it to txt in the first place, as that's really all there is once you remove the image.
Attached Files
File Type: pdf sample-without-image.pdf (34.1 KB, 401 views)
File Type: pdf sample-resized-image.pdf (106.6 KB, 369 views)
frostschutz is offline   Reply With Quote
Old 09-15-2012, 12:36 PM   #15
copyrite
Wizard
copyrite ought to be getting tired of karma fortunes by now.copyrite ought to be getting tired of karma fortunes by now.copyrite ought to be getting tired of karma fortunes by now.copyrite ought to be getting tired of karma fortunes by now.copyrite ought to be getting tired of karma fortunes by now.copyrite ought to be getting tired of karma fortunes by now.copyrite ought to be getting tired of karma fortunes by now.copyrite ought to be getting tired of karma fortunes by now.copyrite ought to be getting tired of karma fortunes by now.copyrite ought to be getting tired of karma fortunes by now.copyrite ought to be getting tired of karma fortunes by now.
 
copyrite's Avatar
 
Posts: 1,814
Karma: 4985051
Join Date: Sep 2010
Location: Maryland
Device: ...lots! ;) mostly reading on a Kindle Voyage
Quote:
Originally Posted by frostschutz View Post
Sorry for my lack of reply. This forum makes it hard to follow threads you've replied to... (unless you want to be bombarded with notification mails)
Easy peasy... click on User CP in the blue menu (just under the graphic of the forum name) or favorite this link; all of your subscribed threads are listed. You can subscribe without receiving notifications, to make that your default click here, look for the Default Thread Subscription Mode section.


Last edited by copyrite; 09-15-2012 at 12:40 PM. Reason: 'tis a work in progress LOL
copyrite is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
remove OCR from a PDF? soondai PDF 9 10-08-2011 12:42 PM
How to convert an OCR file to a Non-OCR one res9282 PDF 1 08-05-2011 05:58 AM
Backround color? paquitz Calibre 3 11-21-2010 09:20 PM
RFE: Remove remove tags in bulk edit magphil Calibre 0 08-11-2009 10:37 AM
White text on black backround? Fingers Which one should I buy? 7 12-21-2007 12:19 PM


All times are GMT -4. The time now is 09:42 PM.


MobileRead.com is a privately owned, operated and funded community.