![]() |
#1 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 672
Karma: 1109784
Join Date: Aug 2010
Device: Paperwhite
|
remove OCR from a PDF?
Is there any tool for removing the OCR element from PDFs?
I have a few scanned books with it, and while it's great for reading on the PC, these files tend to be very large and often cannot be cropped to fit an e-reader. |
![]() |
![]() |
![]() |
#2 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
The question is a little difficult to understand.
Do you mean that it's a "Searchable Image" PDF with both a text layer, and an image layer, with the text layer generated by OCRing the image layer, and you want to remove the text layer? I'm sure there are ways of doing that (e.g., converting the PDF to some other image format and then converting back, etc.), but I really can't see the point. The image layer is almost certainly almost entirely responsible for the large file size. And I don't see how the text layer could interfere with cropping either. Or did you want to remove the image layer instead? That would only be worth it if the OCR was near perfect, or you were planning on cleaning it up manually, which is a huge time commitment. |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 672
Karma: 1109784
Join Date: Aug 2010
Device: Paperwhite
|
I assume it's the text layer keeping soPDF from working with it
I probably need to just hold off reading my PDF books until I have a better machine for it |
![]() |
![]() |
![]() |
#4 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
As far as I know, SoPDF cannot crop scanned margins at all, text layer or no text layer. In general, it's not an ideal tool for scanned PDFs.
I'd try BRISS instead. |
![]() |
![]() |
![]() |
#5 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 672
Karma: 1109784
Join Date: Aug 2010
Device: Paperwhite
|
ha
should have known. I was trying soPDF because I wanted to rotate it as well. thanks for the tip |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
You could use BRISS to crop it, and then run it through SoPDF afterward to rotate it. Good luck.
|
![]() |
![]() |
![]() |
#7 |
Junior Member
![]() Posts: 1
Karma: 10
Join Date: Dec 2010
Device: none
|
I don't know if this is exactly what you're looking for, but I often use online OCR conversion software to convert my PDF documents to MSWord or other text style documents. My favorite converter is by Ricoh Innovations'. You can try it, for free, at: http://beta.rii.ricoh.com/betalabs/c...ent-conversion
|
![]() |
![]() |
![]() |
#8 |
Junior Member
![]() Posts: 1
Karma: 10
Join Date: Oct 2011
Device: Booq
|
Not the OCR but the image
Hi,
I'd like to accomplish just the opposite to what soondai demands: to get rid of the image and just retain the plain text. Is it that possible with some tool? And if not, does any body know the structural details of the pages in scanned pdfs? I think it would be possible to write a small app using itextpdf. With kind regards Alfred D. |
![]() |
![]() |
![]() |
#9 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
|
Should've made a new topic instead of bumping a 2010 thread but whatever, I'll try to answer.
Editing PDFs is never a good idea. Best would be to go back to the original format, make the changes and export as a fresh PDF. Sure, Adobe Acrobat, Foxit Phantom (and similar) can edit PDFs if you wish to get rid of the images. Or you could just copy-paste the text (right click - "Copy Text to Clipboard" or something like that) into a Word/LibreOffice document. For extracting text from images or protected PDFs you can use ABBYY FineReader 11. It will load the PDF as a bunch of JPG images and OCR it. For best result you'll have to proof read it since it's not 100% accurate. There's also the issue with fonts... You can either match them with something similar or extract them from the PDF with FontForge or something similar. Regarding the "structural details" of PDFs... There are two types of PDF files: plain PDF and tagged PDF. You'll find that the plain format is used in over 90% of PDFs. This is a really PITA to convert since the content (text, images) are just floating objects on a blank piece of paper. You can usually spot these right away if you highlight the text and they're all separate letters/numbers (or groups of them). Tagged PDFs, on the other hand, use formatting tags - meaning they're usually more accurate to convert because the text is on a single line instead of each individual glyph (or groups of glyphs) with their own "position" (coordinates) on the page. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Google Adds OCR for PDF Files | kjk | News | 0 | 06-22-2010 02:27 PM |
Remove Header from PDF | rrosenwald | Calibre | 10 | 08-22-2009 08:36 PM |
PDF Image -> OCR -> text | frikk | Workshop | 9 | 07-08-2009 07:21 PM |
remove pdf margins | Hanselda | Bookeen | 12 | 05-13-2009 08:30 AM |
Free/Shareware PDF converters with OCR capability? | Thorkin | 3 | 03-20-2009 09:27 AM |