iLiad Teasing 2: extract snippets/tag PDFs

daudi · 04-14-2008, 02:26 AM

I figured Rio shouldn't be the only tease around here so here's my contribution. It's a proof of concept that seems to be working, but is a tease because it uses a language that few are likely to use here (it uses R, a scripting language I am more familiar with than any other so can get things done with it).

Several people have expressed a desire for a way to highlight or annotate journal articles etc directly on the iliad. OK, we can annotate and underline using scribbles, but we can't currently do much with the results, except just read them; there's no way to search them for example.

Here's an approach that might get part way there. I can now mark up a PDF on the iliad using scribbles then run my script on the result (on my PC) and it will extract snippets of text that have been marked up. If I mark up the text with an L-shape the result is stored as a snippet; if I use an inverted L-shape the result is stored as a tag. See the attached PDF, which when processed produced the following results:

Code:

SNIPPETS:
Europe has already said it will press the G7 to demand
more disclosure from banks on their investments as the credit
crunch spreads from the financial sector to the household and
corporate sectors.

How bad will
a whole? This n
financial servi

A 
housi

property 

 Square Mile,

GDP 

 IMF

 FS

TAGS: 
chancellor

Some of this looks truncated weirdly, but that's because the selection area is deliberately short. See attached PDF for the marked-up page and it should make sense.

I use

Code:

pdftohtml -xml

to produce an xml representation of the PDF. This is easy to get for linux, might be available for Macs, but probably will be hard for people with windows. This is meant to be a proof of concept, and perhaps someone out there who knows java or C++ could use the poppler library or some other library to make a more platform independent solution.

TODO:

Use a different pen colour for mark-up. That way I can easily have notes and mark-up and keep them separate.
See if I can extract an image of the selected area. That way the same approach could be used to extract images or tables (which don't export well to text because they have no structure).
Automatically integrate the output with my bibliography tool.
Store the snippets and tags in the PDF container directory and create a search tool to search snippets and tags directly on the iliad (partially done).
Encourage someone to rewrite this using java or something so it can be easily used by other people, or better still see if someone (not me) could do it by modifying ipdf directly.
If no-one else will port it, then I'll eventually try to get around to moving it to jython.

nekokami · 04-14-2008, 10:15 AM

Wow! Please keep us posted. I'm very interested in this functionality. And I know a Java programmer who might help.

I've used R, but I had no idea you could do stuff like this with it.

pilotbob · 04-14-2008, 11:57 AM

Yes, it would be nice if all the publisers would put their content in DocBook XML (or other open format) which would make conversion to any format simple. I'm not sure why there is pursuit of new "standars" when there are already actually 3 that I can thing of, ODF and OpenXML being the other two.

BOb

daudi · 04-14-2008, 12:15 PM

Quote:

Originally Posted by nekokami

Wow! Please keep us posted. I'm very interested in this functionality. And I know a Java programmer who might help.

Will do. If you could ask your java programmer if there is any java-only way of getting an XML representation of a PDF that would help to direct things. At the moment I am using pdftohtml to do that. pdftohtml is based on xpdf code and I think this is the same as or related to the poppler library. AFAICT these are all C (C++?). I did a quick search to see if there is a java implementation of poppler but could not find anything. Perhaps there is some other java library that can do this, but I don't know enough about java to begin to look around efficiently.

The key thing is that the output of pdftohtml -xml is an xml file with the co-ordinates of each line of text. If there is a way this can be done using java alone then it will be easy to implement the rest and have the whole thing platform independent.

Quote:

Originally Posted by nekokami

I've used R, but I had no idea you could do stuff like this with it.

I use R for my work almost every day (not necessarily very deeply) and so I'm more familiar with it than anything else. The R code to do this uses none of the things that makes R special, I'm just using it as a scripting language. I've used R for all sorts of things unrelated to statistics. I suppose one good reason for using R here it is so easy to use the plotting facilities of R to plot the scribbles. This helped with debugging.

daudi · 04-22-2008, 09:00 AM

I have now moved my code to python and added some additional functionality.

<caveat>
I do not know how to write good python code, so for those who do know how to write good, pythonic code I would like to inform you that I will not be liable for any rehabilitation or psychotherapy fees that you may incur as a result of reading this code. Really, it is very, very nasty in places (that's "places" as in "most places"). If you have questions about the design you'll have to wait until I have designed it

. This is a series of hacks that developed a life of their own.
</caveat>

At the moment I have not bothered too much about making this work on different operating systems. It depends on pdftohtml which is now part of the poppler project. There is a darwin port for mac and it apparently compiles out of the box on cygwin. Someone has created a windows version, but have wrapped a GUI around it, and I don't know if it can be used from the command line. If it can work from the command line this python code would still need to be tweak a little (but not much).

I had to download and compile the latest version of the poppler-utils on my office machine (running ubuntu dapper) but was able to use the repo version at home on ubuntu feisty (or gutsy?). All versions report version 0.36 which is a bit of a pain, because they ain't the same AFAICT.

To use this script mark-up a PDF on the iliad. L-shapes select text as snippets, inverted L-shapes are intended to select single words as tags. The default is to use a different colour to the default pen colour so that you can make notes with one colour and select text with another.

Copy the PDF container folder to your PC (or connect via USB or samba) then run snippet on it.

The extracted text is saved in two files: snippets and tags which are stored within the container directory.

Make sure the script is executable with

Code:

chmod +x snippets

.
The options at the moment are:

Code:

snippets [-hbk] [-p <path-to-pdftohtml>] [-c <colour>] <directory>

 -h                  print the help message
 
 -c <colour>         the colour (color) that identifies strokes that markup areas
                     to be extracted as snippets or tags. Should be one of:
                     #000000, #555555, or the other two colours. Need to add them
                     to this list. The default (i.e. if you do not specify a colour with this option)
                     is #555555, which is the colour next to black (second from the
                     right when selecting colours on the iliad).

 -k                  keep the full xml output of pdftohtml. Default is to delete it.

 -p <path-to-pdftohtml>   path to pdftohtml (in case you need to specify a custom
                          version)
                     
 -b                  use a brute-force approach to cleaning up XML that is not
                     well-formed. If the XML output from pdftohtml is not well-formed
                     you'll probably get a "mismatched tag" error.

 <directory>         input container directory 


EXAMPLE: snippets -b -c "#000000" test.pdf

I have tested it on a few files. Selecting snippets works well, selecting tags is a little hit and miss at the moment so I tend to be generous with my inverted L-shapes to make sure I get the text.

PROBLEMS:

I have had a few problems along the way, and getting the script to this stage took me much longer than I anticipated. I had to learn about a number of things that were new to me (e.g. how to work with XML, the difference between MediaBox and CropBox in PDFs, etc.).

Some of the problems remain unresolved, or are dealt with in brutal manner. In particular I have had problems with unicode and characters that appear in the XML output from pdftohtml that are below ascii code 32. The Guardian TopStories.pdf has a few of these (that appear as ^C which is ETX (???) and ^B etc). The -b option activates some code that attempts to deal with some problems, mainly tags that are in the wrong order, but I have not been able to get those ^C things sorted.

The script does run, however, on a most of the PDFs I have tried. I am now going to actually start using the blasted thing and see what else needs fixing. My intention is to use this approach to extract text and have a version of the multi-directory search tool to search snippets and tags on the iliad.

I'd be grateful if people could try this out and provide feedback.
[Note: you'll want to remove the .txt extension from the script]

[Edit 2008-04-22] Minor edits to script. Note also that the script does not handle files and containers with spaces in the names. Some quoting is needed in several places.

[Edit 2008-04-23] Added option to extract images of selected regions (text or embedded images) using imagemagick. Also creates a simple HTML file for displaying the extracts.

nekokami · 04-22-2008, 09:48 AM

I probably won't be able to look at it until this weekend, but I'll try to give it a good workout then.

daudi · 04-22-2008, 10:23 AM

OK. I look forward to hearing your comments and ideas. A couple of things that I might start thinking about are:

Extracting tables: the plain text output makes table columns hard to work with and I find myself trying to line things up with regular expressions or in a spreadsheet. I am thinking that I could use another pen colour for tables and have vertical lines define the table columns. Still need to think that one through.
An option that allows the user to specify a portion of selected text to be written into the manifest, either as the title or in the description. That way you could just read an article, mark-up the title of the article and have it processed by the script to appear as the title or description.
An option to writing the output as XML instead of plain text. That had better wait until I get a better grasp of XML
Instead of writing new files each time (as the script does at the moment) compare coordinates of snippets and only add/delete those that have changed. This would allow external additions to the tags file, for example.

One other thing to note: this rewrites the snippets and tags files each time. So, if you manually add other tags to the tags file (I was originally thinking about exporting keywords from jabref) they'll get wiped out next time you process the folder.

nekokami · 04-22-2008, 10:32 AM

Quote:

Originally Posted by daudi

OK. I look forward to hearing your comments and ideas. A couple of things that I might start thinking about are:

Extracting tables: the plain text output makes table columns hard to work with and I find myself trying to line things up with regular expressions or in a spreadsheet. I am thinking that I could use another pen colour for tables and have vertical lines define the table columns. Still need to think that one through.
An option that allows the user to specify a portion of selected text to be written into the manifest, either as the title or in the description. That way you could just read an article, mark-up the title of the article and have it processed by the script to appear as the title or description.
An option to writing the output as XML instead of plain text. That had better wait until I get a better grasp of XML

One other thing to note: this rewrites the snippets and tags files each time. So, if you manually add other tags to the tags file (I was originally thinking about exporting keywords from jabref) they'll get wiped out next time you process the folder.

Being able to extract title and author would be very helpful. Couldn't the processing append to the file, rather than rewriting it?

daudi · 04-22-2008, 10:46 AM

Quote:

Originally Posted by nekokami

Being able to extract title and author would be very helpful. Couldn't the processing append to the file, rather than rewriting it?

Originally I did have it appending it, but then I ended up with duplications if I read an article again later and made further snippets. This could be handled by looking at the coordinates of the snippets already taken, but that started to look like too much hard work at the time (that was when I was still battling with bad XML and unicode issues). I'll add it to the list above and have a think about it.

daudi · 04-23-2008, 11:38 AM

I've just added an option to extract images of the selected areas (these can be text or images in the document) and also create a simple HTML file to show the images and extracted text.

Code:

 -i                  extract images of selected areas. You need to have imagemagick
                     installed and on your path. If this proves to be useful I'll need
                     to add ways of specifying more parameters for image creation.
                     This also produces a rudimentary HTML file that links the images
                     and snippets (currently in the order they were made).

So, to extract text both as text and images of text from a pdf container directory called test.pdf you would do this:

Code:

snippets -i test.pdf

I have attached an example of some extracted text and the images of the text as well as some extracted images to this post. I have updated the post above with the latest version of the code.

Note that image extraction requires imagemagick.
Note also that I need to deal with spaces in file paths so this will not work with PDFs with spaces in the file names. This should not be hard to do, I just need to get around to doing it.

daudi · 04-25-2008, 04:42 AM

Quote:

Originally Posted by nekokami

I probably won't be able to look at it until this weekend, but I'll try to give it a good workout then.

Here's something else to think about with this. It is possible to extend the multi-directory search script to search the snippets and create result entries that are more than just symlinks to the matching PDF. The result could contain the title from the original PDF and the description can contain the context from the snippet that matches, including the page number of the match. It is possible to have multiple matches per PDF. Moreover the manifest can be crafted to open the PDF at the page number of match. I have tested each step necessary to do this, but have not got the whole lot together. All of this uses shell scripts and the stock awk and sed that comes with the iliad so there is no software to install, just the search script.

The snippet extraction plus search tool means the following work flow is possible:

search for and download refs as PDFs (probably on a PC)
copy refs to the iliad (CF, USB, internal memory, wherever)
go find a nice comfy leather chair or bean bag, or sit in the park [important step]
read and mark-up relevant parts of the PDFs
go back to your PC and run the snippet script on the marked-up PDFs
review the HTML version of the snippets quickly on the PC while writing; and/or
search (possibly at a later date) the snippets on the iliad

Here's a screenshot of a match for the word "urban" in my test case. Both results are from the same original PDF, but relate to snippets from different pages. The page number of the match is in square brackets.

I am also starting to think about extending the logic of the searches and having a "combine" entry. It should be possible to make several searches and keep each set of results and then combine them (like the ovid command line syntax). This would mean you could:

search for urban [result set 1]
search for health [result set 2]
search for transport [result set 3]
search for migration [result set 4]
then combine them by changing the description of the (yet to be created) combine script to:
- (1 and 2 not 3)
- (1 OR 4) and 2

I also need to give some thought to integration of the snippets with my main PC-based bibliography tool.

nekokami · 04-26-2008, 09:21 PM

Quote:

Originally Posted by daudi

The snippet extraction plus search tool means the following work flow is possible:

search for and download refs as PDFs (probably on a PC)
copy refs to the iliad (CF, USB, internal memory, wherever)
go find a nice comfy leather chair or bean bag, or sit in the park [important step]
read and mark-up relevant parts of the PDFs
go back to your PC and run the snippet script on the marked-up PDFs
review the HTML version of the snippets quickly on the PC while writing; and/or
search (possibly at a later date) the snippets on the iliad

Excellent!

Can we categorize snippets when we capture them? Or afterward, when reviewing them in the HTML version?

daudi · 04-27-2008, 03:00 AM

Quote:

Originally Posted by nekokami

Excellent!
Can we categorize snippets when we capture them? Or afterward, when reviewing them in the HTML version?

I can't think of a way to categorize them as we capture them. One thing that I see as being handy with this approach is that marking-up text for extraction is so simple and quick that it does not interfere with the flow of reading. I guess one possibility might be to link with Rio's tease (on Mac). It should not be hard to use the same approach to mark-up scribbles for extraction. They could then be run through the character recognition software. But that's something that I could not help out with (no mac).

The snippets could, however, easily be edited at a later stage as they have a very simple structure. Here's one:

Quote:

Page: 2 x: 159.204963032--346.022710981 y: 566.062090781--644.585947887
Conclusion?These data con?rm an inverse association between socioeconomic
status and the prevalence of type 2 diabetes in the middle years of life. This

These could easily be edited. It would make sense to have a simple structure to them, e.g. agree that categories should be on the line after the page number. Keeping it simple like this means that it is easy to write awk scripts to process them on the iliad. It would probably be relatively simple to create a small application to handle them on the PC (using python or java). I could imagine something that is able to display the hierarchy of documents on the iliad (mounted on USB or samba or from the CF card from the iliad), and when you open one up you see a entry for each page that has extracted text, and a preview of the text or image and a way to enter categories.

[BTW, in the extract above notice that the hyphen and 'fi' ligature have been converted to '?' because I need help to understand encoding schemes]

daudi · 04-27-2008, 04:12 AM

It might not even need an application to handle snippets. jabref is very flexible and it is easy to add a file link within a bibliography reference that links to the snippet. That way you can easily open the snippet from the bibliography entry. You can't search the snippet from with jabref (yet) but I'll have a think about how that might be possible. As much as possible I would like to keep things integrated with my full bibliography manager.

In fact it might not be too hard at all. I think we could add a custom field for snippets to jabref and have a way to import/update that field from the snippets file in the PDF container folder. It should be possible to either create a custom import/export filter to keep the two in sync or create a script that does this. (The new version of jabref is going to have a plugin system that would make it easier for someone to create a java plugin that could integrate this more elegantly.)

Doing this would mean that it would be possible to use jabref to manage references and to do quite powerful searches (including searches of the snippets) and still have the ability to do searches on snippets directly on the iliad.

But, again, we need to be careful about what happens if the PDF is processed again, e.g. if it is read again at a later date, perhaps for a different purpose, and more text is marked-up. We'd need to keep track of existing snippets and not wipe out the categories that had been added externally. This is can be done, but means more work.

daudi · 05-01-2008, 01:04 PM

I've added a snippet search tool to the multi-directory search tool. This allows you to search within snippets and produces results that show the title of the original file plus a couple of lines of the context of the match (in the description). The description starts with the page number in the original PDF where the matching snippet comes from. I was not able to make this open on the matching page (at least not nicely), but this allows you to see which page to jump to once you open the PDF (via the result set).
This tool uses the same config tool to decide where to search so you could set up directories of PDFs and then choose to only search within some. Here's an example of the result of a search.

04-22-2008, 10:23 AM	#7
daudi Addict Posts: 281 Karma: 904 Join Date: Oct 2007 Location: Kent, UK Device: iRex iLiad, Psion 5MX, nokia n800	OK. I look forward to hearing your comments and ideas. A couple of things that I might start thinking about are: Extracting tables: the plain text output makes table columns hard to work with and I find myself trying to line things up with regular expressions or in a spreadsheet. I am thinking that I could use another pen colour for tables and have vertical lines define the table columns. Still need to think that one through. An option that allows the user to specify a portion of selected text to be written into the manifest, either as the title or in the description. That way you could just read an article, mark-up the title of the article and have it processed by the script to appear as the title or description. An option to writing the output as XML instead of plain text. That had better wait until I get a better grasp of XML Instead of writing new files each time (as the script does at the moment) compare coordinates of snippets and only add/delete those that have changed. This would allow external additions to the tags file, for example. One other thing to note: this rewrites the snippets and tags files each time. So, if you manually add other tags to the tags file (I was originally thinking about exporting keywords from jabref) they'll get wiped out next time you process the folder. Last edited by daudi; 04-22-2008 at 10:49 AM. Reason: Added idea to change from write to append

04-27-2008, 04:12 AM	#14
daudi Addict Posts: 281 Karma: 904 Join Date: Oct 2007 Location: Kent, UK Device: iRex iLiad, Psion 5MX, nokia n800	It might not even need an application to handle snippets. jabref is very flexible and it is easy to add a file link within a bibliography reference that links to the snippet. That way you can easily open the snippet from the bibliography entry. You can't search the snippet from with jabref (yet) but I'll have a think about how that might be possible. As much as possible I would like to keep things integrated with my full bibliography manager. In fact it might not be too hard at all. I think we could add a custom field for snippets to jabref and have a way to import/update that field from the snippets file in the PDF container folder. It should be possible to either create a custom import/export filter to keep the two in sync or create a script that does this. (The new version of jabref is going to have a plugin system that would make it easier for someone to create a java plugin that could integrate this more elegantly.) Doing this would mean that it would be possible to use jabref to manage references and to do quite powerful searches (including searches of the snippets) and still have the ability to do searches on snippets directly on the iliad. But, again, we need to be careful about what happens if the PDF is processed again, e.g. if it is read again at a later date, perhaps for a different purpose, and more text is marked-up. We'd need to keep track of existing snippets and not wipe out the categories that had been added externally. This is can be done, but means more work. Last edited by daudi; 04-27-2008 at 04:39 AM.

05-01-2008, 01:04 PM	#15
daudi Addict Posts: 281 Karma: 904 Join Date: Oct 2007 Location: Kent, UK Device: iRex iLiad, Psion 5MX, nokia n800	snippet search now works I've added a snippet search tool to the multi-directory search tool. This allows you to search within snippets and produces results that show the title of the original file plus a couple of lines of the context of the match (in the description). The description starts with the page number in the original PDF where the matching snippet comes from. I was not able to make this open on the matching page (at least not nicely), but this allows you to see which page to jump to once you open the PDF (via the result set). This tool uses the same config tool to decide where to search so you could set up directories of PDFs and then choose to only search within some. Here's an example of the result of a search. Attached Thumbnails

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
epub code snippets (html / css)	zelda_pinwheel	ePub	196	10-09-2016 04:21 AM
[Old Thread] Extract ISBN from file name	ChristianQ	Calibre	59	12-09-2015 05:08 AM
Programming language code snippets in ebooks?	Connochaetes	Writers' Corner	7	10-18-2010 02:43 PM
Emailing snippets from Kindle...	gfmucci	Amazon Kindle	0	05-17-2010 08:56 AM
iLiad Teasing :D	rio	iRex Developer's Corner	17	04-14-2008 04:28 AM

04-14-2008, 10:15 AM	#2
nekokami fruminous edugeek Posts: 6,745 Karma: 551260 Join Date: Oct 2006 Location: Northeast US Device: iPad, eBw 1150	Wow! Please keep us posted. I'm very interested in this functionality. And I know a Java programmer who might help. I've used R, but I had no idea you could do stuff like this with it.

04-14-2008, 11:57 AM	#3
pilotbob Grand Sorcerer Posts: 19,832 Karma: 11844413 Join Date: Jan 2007 Location: Tampa, FL USA Device: Kindle Touch	Yes, it would be nice if all the publisers would put their content in DocBook XML (or other open format) which would make conversion to any format simple. I'm not sure why there is pursuit of new "standars" when there are already actually 3 that I can thing of, ODF and OpenXML being the other two. BOb

04-22-2008, 09:48 AM	#6
nekokami fruminous edugeek Posts: 6,745 Karma: 551260 Join Date: Oct 2006 Location: Northeast US Device: iPad, eBw 1150	I probably won't be able to look at it until this weekend, but I'll try to give it a good workout then.