View Single Post
Old 10-08-2011, 12:24 PM   #9
DSpider
Evangelist
DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.DSpider ought to be getting tired of karma fortunes by now.
 
DSpider's Avatar
 
Posts: 450
Karma: 343115
Join Date: Nov 2009
Location: Romania
Device: PW2 2014
Should've made a new topic instead of bumping a 2010 thread but whatever, I'll try to answer.

Editing PDFs is never a good idea. Best would be to go back to the original format, make the changes and export as a fresh PDF. Sure, Adobe Acrobat, Foxit Phantom (and similar) can edit PDFs if you wish to get rid of the images. Or you could just copy-paste the text (right click - "Copy Text to Clipboard" or something like that) into a Word/LibreOffice document.

For extracting text from images or protected PDFs you can use ABBYY FineReader 11. It will load the PDF as a bunch of JPG images and OCR it. For best result you'll have to proof read it since it's not 100% accurate. There's also the issue with fonts... You can either match them with something similar or extract them from the PDF with FontForge or something similar.

Regarding the "structural details" of PDFs... There are two types of PDF files: plain PDF and tagged PDF. You'll find that the plain format is used in over 90% of PDFs. This is a really PITA to convert since the content (text, images) are just floating objects on a blank piece of paper. You can usually spot these right away if you highlight the text and they're all separate letters/numbers (or groups of them). Tagged PDFs, on the other hand, use formatting tags - meaning they're usually more accurate to convert because the text is on a single line instead of each individual glyph (or groups of glyphs) with their own "position" (coordinates) on the page.
DSpider is offline   Reply With Quote