Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 11-14-2010, 01:05 AM   #1
soondai
Guru
soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.
 
soondai's Avatar
 
Posts: 672
Karma: 1109784
Join Date: Aug 2010
Device: Paperwhite
remove OCR from a PDF?

Is there any tool for removing the OCR element from PDFs?

I have a few scanned books with it, and while it's great for reading on the PC, these files tend to be very large and often cannot be cropped to fit an e-reader.
soondai is offline   Reply With Quote
Old 11-14-2010, 10:07 PM   #2
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
The question is a little difficult to understand.

Do you mean that it's a "Searchable Image" PDF with both a text layer, and an image layer, with the text layer generated by OCRing the image layer, and you want to remove the text layer?

I'm sure there are ways of doing that (e.g., converting the PDF to some other image format and then converting back, etc.), but I really can't see the point. The image layer is almost certainly almost entirely responsible for the large file size. And I don't see how the text layer could interfere with cropping either.

Or did you want to remove the image layer instead? That would only be worth it if the OCR was near perfect, or you were planning on cleaning it up manually, which is a huge time commitment.
frabjous is offline   Reply With Quote
Advert
Old 11-14-2010, 11:23 PM   #3
soondai
Guru
soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.
 
soondai's Avatar
 
Posts: 672
Karma: 1109784
Join Date: Aug 2010
Device: Paperwhite
I assume it's the text layer keeping soPDF from working with it

I probably need to just hold off reading my PDF books until I have a better machine for it
soondai is offline   Reply With Quote
Old 11-15-2010, 12:35 AM   #4
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
As far as I know, SoPDF cannot crop scanned margins at all, text layer or no text layer. In general, it's not an ideal tool for scanned PDFs.

I'd try BRISS instead.
frabjous is offline   Reply With Quote
Old 11-15-2010, 01:49 AM   #5
soondai
Guru
soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.soondai ought to be getting tired of karma fortunes by now.
 
soondai's Avatar
 
Posts: 672
Karma: 1109784
Join Date: Aug 2010
Device: Paperwhite
ha
should have known.
I was trying soPDF because I wanted to rotate it as well.

thanks for the tip
soondai is offline   Reply With Quote
Advert
Old 11-15-2010, 11:38 AM   #6
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
You could use BRISS to crop it, and then run it through SoPDF afterward to rotate it. Good luck.
frabjous is offline   Reply With Quote
Old 12-07-2010, 04:51 PM   #7
NatalieLyda
Junior Member
NatalieLyda began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Dec 2010
Device: none
I don't know if this is exactly what you're looking for, but I often use online OCR conversion software to convert my PDF documents to MSWord or other text style documents. My favorite converter is by Ricoh Innovations'. You can try it, for free, at: http://beta.rii.ricoh.com/betalabs/c...ent-conversion
NatalieLyda is offline   Reply With Quote
Old 10-08-2011, 05:45 AM   #8
alfred_doeblin
Junior Member
alfred_doeblin began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Oct 2011
Device: Booq
Not the OCR but the image

Hi,

I'd like to accomplish just the opposite to what soondai demands: to get rid of the image and just retain the plain text. Is it that possible with some tool? And if not, does any body know the structural details of the pages in scanned pdfs? I think it would be possible to write a small app using itextpdf.

With kind regards

Alfred D.
alfred_doeblin is offline   Reply With Quote
Old 10-08-2011, 12:24 PM   #9
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
Should've made a new topic instead of bumping a 2010 thread but whatever, I'll try to answer.

Editing PDFs is never a good idea. Best would be to go back to the original format, make the changes and export as a fresh PDF. Sure, Adobe Acrobat, Foxit Phantom (and similar) can edit PDFs if you wish to get rid of the images. Or you could just copy-paste the text (right click - "Copy Text to Clipboard" or something like that) into a Word/LibreOffice document.

For extracting text from images or protected PDFs you can use ABBYY FineReader 11. It will load the PDF as a bunch of JPG images and OCR it. For best result you'll have to proof read it since it's not 100% accurate. There's also the issue with fonts... You can either match them with something similar or extract them from the PDF with FontForge or something similar.

Regarding the "structural details" of PDFs... There are two types of PDF files: plain PDF and tagged PDF. You'll find that the plain format is used in over 90% of PDFs. This is a really PITA to convert since the content (text, images) are just floating objects on a blank piece of paper. You can usually spot these right away if you highlight the text and they're all separate letters/numbers (or groups of them). Tagged PDFs, on the other hand, use formatting tags - meaning they're usually more accurate to convert because the text is on a single line instead of each individual glyph (or groups of glyphs) with their own "position" (coordinates) on the page.
DSpider is offline   Reply With Quote
Old 10-08-2011, 12:42 PM   #10
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
For something free and open source, you could try PDFreflow, which uses the PDFtoHTML from Poppler as a backend. (Poppler also contains a pdftotext tool.)
frabjous is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Google Adds OCR for PDF Files kjk News 0 06-22-2010 02:27 PM
Remove Header from PDF rrosenwald Calibre 10 08-22-2009 08:36 PM
PDF Image -> OCR -> text frikk Workshop 9 07-08-2009 07:21 PM
remove pdf margins Hanselda Bookeen 12 05-13-2009 08:30 AM
Free/Shareware PDF converters with OCR capability? Thorkin PDF 3 03-20-2009 09:27 AM


All times are GMT -4. The time now is 03:56 PM.


MobileRead.com is a privately owned, operated and funded community.