remove OCR from a PDF?

soondai · 11-14-2010, 02:05 AM

Is there any tool for removing the OCR element from PDFs?

I have a few scanned books with it, and while it's great for reading on the PC, these files tend to be very large and often cannot be cropped to fit an e-reader.

frabjous · 11-14-2010, 11:07 PM

The question is a little difficult to understand.

Do you mean that it's a "Searchable Image" PDF with both a text layer, and an image layer, with the text layer generated by OCRing the image layer, and you want to remove the text layer?

I'm sure there are ways of doing that (e.g., converting the PDF to some other image format and then converting back, etc.), but I really can't see the point. The image layer is almost certainly almost entirely responsible for the large file size. And I don't see how the text layer could interfere with cropping either.

Or did you want to remove the image layer instead? That would only be worth it if the OCR was near perfect, or you were planning on cleaning it up manually, which is a huge time commitment.

soondai · 11-15-2010, 12:23 AM

I assume it's the text layer keeping soPDF from working with it

I probably need to just hold off reading my PDF books until I have a better machine for it

frabjous · 11-15-2010, 01:35 AM

As far as I know, SoPDF cannot crop scanned margins at all, text layer or no text layer. In general, it's not an ideal tool for scanned PDFs.

I'd try BRISS instead.

soondai · 11-15-2010, 02:49 AM

ha
should have known.
I was trying soPDF because I wanted to rotate it as well.

thanks for the tip

frabjous · 11-15-2010, 12:38 PM

You could use BRISS to crop it, and then run it through SoPDF afterward to rotate it. Good luck.

NatalieLyda · 12-07-2010, 05:51 PM

I don't know if this is exactly what you're looking for, but I often use online OCR conversion software to convert my PDF documents to MSWord or other text style documents. My favorite converter is by Ricoh Innovations'. You can try it, for free, at: http://beta.rii.ricoh.com/betalabs/c...ent-conversion

alfred_doeblin · 10-08-2011, 06:45 AM

Hi,

I'd like to accomplish just the opposite to what soondai demands: to get rid of the image and just retain the plain text. Is it that possible with some tool? And if not, does any body know the structural details of the pages in scanned pdfs? I think it would be possible to write a small app using itextpdf.

With kind regards

Alfred D.

DSpider · 10-08-2011, 01:24 PM

Should've made a new topic instead of bumping a 2010 thread but whatever, I'll try to answer.

Editing PDFs is never a good idea. Best would be to go back to the original format, make the changes and export as a fresh PDF. Sure, Adobe Acrobat, Foxit Phantom (and similar) can edit PDFs if you wish to get rid of the images. Or you could just copy-paste the text (right click - "Copy Text to Clipboard" or something like that) into a Word/LibreOffice document.

For extracting text from images or protected PDFs you can use ABBYY FineReader 11. It will load the PDF as a bunch of JPG images and OCR it. For best result you'll have to proof read it since it's not 100% accurate. There's also the issue with fonts... You can either match them with something similar or extract them from the PDF with FontForge or something similar.

Regarding the "structural details" of PDFs... There are two types of PDF files: plain PDF and tagged PDF. You'll find that the plain format is used in over 90% of PDFs. This is a really PITA to convert since the content (text, images) are just floating objects on a blank piece of paper. You can usually spot these right away if you highlight the text and they're all separate letters/numbers (or groups of them). Tagged PDFs, on the other hand, use formatting tags - meaning they're usually more accurate to convert because the text is on a single line instead of each individual glyph (or groups of glyphs) with their own "position" (coordinates) on the page.

frabjous · 10-08-2011, 01:42 PM

For something free and open source, you could try PDFreflow, which uses the PDFtoHTML from Poppler as a backend. (Poppler also contains a pdftotext tool.)

11-14-2010, 02:05 AM	#1
soondai Guru Posts: 672 Karma: 1109784 Join Date: Aug 2010 Device: Paperwhite	remove OCR from a PDF? Is there any tool for removing the OCR element from PDFs? I have a few scanned books with it, and while it's great for reading on the PC, these files tend to be very large and often cannot be cropped to fit an e-reader.

10-08-2011, 06:45 AM	#8
alfred_doeblin Junior Member Posts: 1 Karma: 10 Join Date: Oct 2011 Device: Booq	Not the OCR but the image Hi, I'd like to accomplish just the opposite to what soondai demands: to get rid of the image and just retain the plain text. Is it that possible with some tool? And if not, does any body know the structural details of the pages in scanned pdfs? I think it would be possible to write a small app using itextpdf. With kind regards Alfred D.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Google Adds OCR for PDF Files	kjk	News	0	06-22-2010 03:27 PM
Remove Header from PDF	rrosenwald	Calibre	10	08-22-2009 09:36 PM
PDF Image -> OCR -> text	frikk	Workshop	9	07-08-2009 08:21 PM
remove pdf margins	Hanselda	Bookeen	12	05-13-2009 09:30 AM
Free/Shareware PDF converters with OCR capability?	Thorkin	PDF	3	03-20-2009 10:27 AM

11-14-2010, 11:07 PM	#2
frabjous Wizard Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505	The question is a little difficult to understand. Do you mean that it's a "Searchable Image" PDF with both a text layer, and an image layer, with the text layer generated by OCRing the image layer, and you want to remove the text layer? I'm sure there are ways of doing that (e.g., converting the PDF to some other image format and then converting back, etc.), but I really can't see the point. The image layer is almost certainly almost entirely responsible for the large file size. And I don't see how the text layer could interfere with cropping either. Or did you want to remove the image layer instead? That would only be worth it if the OCR was near perfect, or you were planning on cleaning it up manually, which is a huge time commitment.

11-15-2010, 12:23 AM	#3
soondai Guru Posts: 672 Karma: 1109784 Join Date: Aug 2010 Device: Paperwhite	I assume it's the text layer keeping soPDF from working with it I probably need to just hold off reading my PDF books until I have a better machine for it

11-15-2010, 01:35 AM	#4
frabjous Wizard Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505	As far as I know, SoPDF cannot crop scanned margins at all, text layer or no text layer. In general, it's not an ideal tool for scanned PDFs. I'd try BRISS instead.

11-15-2010, 02:49 AM	#5
soondai Guru Posts: 672 Karma: 1109784 Join Date: Aug 2010 Device: Paperwhite	ha should have known. I was trying soPDF because I wanted to rotate it as well. thanks for the tip

11-15-2010, 12:38 PM	#6
frabjous Wizard Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505	You could use BRISS to crop it, and then run it through SoPDF afterward to rotate it. Good luck.

12-07-2010, 05:51 PM	#7
NatalieLyda Junior Member Posts: 1 Karma: 10 Join Date: Dec 2010 Device: none	I don't know if this is exactly what you're looking for, but I often use online OCR conversion software to convert my PDF documents to MSWord or other text style documents. My favorite converter is by Ricoh Innovations'. You can try it, for free, at: http://beta.rii.ricoh.com/betalabs/c...ent-conversion

10-08-2011, 01:24 PM	#9
DSpider Evangelist Posts: 450 Karma: 343115 Join Date: Nov 2009 Location: Romania Device: PW2 2014	Should've made a new topic instead of bumping a 2010 thread but whatever, I'll try to answer. Editing PDFs is never a good idea. Best would be to go back to the original format, make the changes and export as a fresh PDF. Sure, Adobe Acrobat, Foxit Phantom (and similar) can edit PDFs if you wish to get rid of the images. Or you could just copy-paste the text (right click - "Copy Text to Clipboard" or something like that) into a Word/LibreOffice document. For extracting text from images or protected PDFs you can use ABBYY FineReader 11. It will load the PDF as a bunch of JPG images and OCR it. For best result you'll have to proof read it since it's not 100% accurate. There's also the issue with fonts... You can either match them with something similar or extract them from the PDF with FontForge or something similar. Regarding the "structural details" of PDFs... There are two types of PDF files: plain PDF and tagged PDF. You'll find that the plain format is used in over 90% of PDFs. This is a really PITA to convert since the content (text, images) are just floating objects on a blank piece of paper. You can usually spot these right away if you highlight the text and they're all separate letters/numbers (or groups of them). Tagged PDFs, on the other hand, use formatting tags - meaning they're usually more accurate to convert because the text is on a single line instead of each individual glyph (or groups of glyphs) with their own "position" (coordinates) on the page.

10-08-2011, 01:42 PM	#10
frabjous Wizard Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505	For something free and open source, you could try PDFreflow, which uses the PDFtoHTML from Poppler as a backend. (Poppler also contains a pdftotext tool.)

Advert

Advert