05-12-2009, 01:20 PM
I have a pdf ebook (mostly containing text) which seems to have been OCRed from a scan, but it seems that the OCRed text is superposed with the actual scan. So it seems that there is a layer of image on top of the text (and the file size is quite big, as a consequence)
Is there anyway to get rid of the image part and just keep the text?
05-24-2009, 01:07 AM
Short of simply selecting and copying the text into a WinWord or Txt Document, there is another way that MIGHT work:
Convert the PDF to a DOC or RTF. RTFs do not (cannot?) have images, but it might keep the formatting otherwise.
05-31-2009, 04:26 AM
i think the course will be a little complicated.
first you should edit you pdf file, i know a free converter pdf to word (http://www.anypdftools.com/pdf-to-word.html#155) may help you. you should convert your pdf file into the word document, then edit it. delete the image.
second you need to convert the word file to txt file by OCRed.
Or maybe you must convert the word document into the pdf file again. this course there are some ways you can use.
1. i think you can download 2007 Microsoft Office Add-in: Microsoft Save as PDF, if you install this software successfully, u can save the word file as pdf directly.
2. you also can use Google Docs save as PDF
and there are a lot of Free Word to PDF Creators here (http://www.anypdftools.com/free-word-to-pdf-creator.html#155)
then at last, convert your pdf file to txt document with your OCRed tool.