View Single Post
Old 04-23-2008, 11:38 AM   #10
daudi
Addict
daudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-books
 
Posts: 281
Karma: 904
Join Date: Oct 2007
Location: Kent, UK
Device: iRex iLiad, Psion 5MX, nokia n800
Still ugly, but does more

I've just added an option to extract images of the selected areas (these can be text or images in the document) and also create a simple HTML file to show the images and extracted text.

Code:
 -i                  extract images of selected areas. You need to have imagemagick
                     installed and on your path. If this proves to be useful I'll need
                     to add ways of specifying more parameters for image creation.
                     This also produces a rudimentary HTML file that links the images
                     and snippets (currently in the order they were made).
So, to extract text both as text and images of text from a pdf container directory called test.pdf you would do this:

Code:
snippets -i test.pdf
I have attached an example of some extracted text and the images of the text as well as some extracted images to this post. I have updated the post above with the latest version of the code.

Note that image extraction requires imagemagick.
Note also that I need to deal with spaces in file paths so this will not work with PDFs with spaces in the file names. This should not be hard to do, I just need to get around to doing it.
Attached Files
File Type: zip snippets.zip (89.9 KB, 490 views)

Last edited by daudi; 04-24-2008 at 03:16 AM. Reason: fixed typo
daudi is offline   Reply With Quote