MobileRead Forums - View Single Post - iLiad Teasing 2: extract snippets/tag PDFs

daudi · 04-23-2008, 11:38 AM

I've just added an option to extract images of the selected areas (these can be text or images in the document) and also create a simple HTML file to show the images and extracted text.

Code:

 -i                  extract images of selected areas. You need to have imagemagick
                     installed and on your path. If this proves to be useful I'll need
                     to add ways of specifying more parameters for image creation.
                     This also produces a rudimentary HTML file that links the images
                     and snippets (currently in the order they were made).

So, to extract text both as text and images of text from a pdf container directory called test.pdf you would do this:

Code:

snippets -i test.pdf

I have attached an example of some extracted text and the images of the text as well as some extracted images to this post. I have updated the post above with the latest version of the code.

Note that image extraction requires imagemagick.
Note also that I need to deal with spaces in file paths so this will not work with PDFs with spaces in the file names. This should not be hard to do, I just need to get around to doing it.