MobileRead Forums - View Single Post - iLiad Teasing 2: extract snippets/tag PDFs

daudi · 04-22-2008, 09:00 AM

I have now moved my code to python and added some additional functionality.

<caveat>
I do not know how to write good python code, so for those who do know how to write good, pythonic code I would like to inform you that I will not be liable for any rehabilitation or psychotherapy fees that you may incur as a result of reading this code. Really, it is very, very nasty in places (that's "places" as in "most places"). If you have questions about the design you'll have to wait until I have designed it

. This is a series of hacks that developed a life of their own.
</caveat>

At the moment I have not bothered too much about making this work on different operating systems. It depends on pdftohtml which is now part of the poppler project. There is a darwin port for mac and it apparently compiles out of the box on cygwin. Someone has created a windows version, but have wrapped a GUI around it, and I don't know if it can be used from the command line. If it can work from the command line this python code would still need to be tweak a little (but not much).

I had to download and compile the latest version of the poppler-utils on my office machine (running ubuntu dapper) but was able to use the repo version at home on ubuntu feisty (or gutsy?). All versions report version 0.36 which is a bit of a pain, because they ain't the same AFAICT.

To use this script mark-up a PDF on the iliad. L-shapes select text as snippets, inverted L-shapes are intended to select single words as tags. The default is to use a different colour to the default pen colour so that you can make notes with one colour and select text with another.

Copy the PDF container folder to your PC (or connect via USB or samba) then run snippet on it.

The extracted text is saved in two files: snippets and tags which are stored within the container directory.

Make sure the script is executable with

Code:

chmod +x snippets

.
The options at the moment are:

Code:

snippets [-hbk] [-p <path-to-pdftohtml>] [-c <colour>] <directory>

 -h                  print the help message
 
 -c <colour>         the colour (color) that identifies strokes that markup areas
                     to be extracted as snippets or tags. Should be one of:
                     #000000, #555555, or the other two colours. Need to add them
                     to this list. The default (i.e. if you do not specify a colour with this option)
                     is #555555, which is the colour next to black (second from the
                     right when selecting colours on the iliad).

 -k                  keep the full xml output of pdftohtml. Default is to delete it.

 -p <path-to-pdftohtml>   path to pdftohtml (in case you need to specify a custom
                          version)
                     
 -b                  use a brute-force approach to cleaning up XML that is not
                     well-formed. If the XML output from pdftohtml is not well-formed
                     you'll probably get a "mismatched tag" error.

 <directory>         input container directory 


EXAMPLE: snippets -b -c "#000000" test.pdf

I have tested it on a few files. Selecting snippets works well, selecting tags is a little hit and miss at the moment so I tend to be generous with my inverted L-shapes to make sure I get the text.

PROBLEMS:

I have had a few problems along the way, and getting the script to this stage took me much longer than I anticipated. I had to learn about a number of things that were new to me (e.g. how to work with XML, the difference between MediaBox and CropBox in PDFs, etc.).

Some of the problems remain unresolved, or are dealt with in brutal manner. In particular I have had problems with unicode and characters that appear in the XML output from pdftohtml that are below ascii code 32. The Guardian TopStories.pdf has a few of these (that appear as ^C which is ETX (???) and ^B etc). The -b option activates some code that attempts to deal with some problems, mainly tags that are in the wrong order, but I have not been able to get those ^C things sorted.

The script does run, however, on a most of the PDFs I have tried. I am now going to actually start using the blasted thing and see what else needs fixing. My intention is to use this approach to extract text and have a version of the multi-directory search tool to search snippets and tags on the iliad.

I'd be grateful if people could try this out and provide feedback.
[Note: you'll want to remove the .txt extension from the script]

[Edit 2008-04-22] Minor edits to script. Note also that the script does not handle files and containers with spaces in the names. Some quoting is needed in several places.

[Edit 2008-04-23] Added option to extract images of selected regions (text or embedded images) using imagemagick. Also creates a simple HTML file for displaying the extracts.