View Single Post
Old 04-14-2008, 02:26 AM   #1
daudi
Addict
daudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-books
 
Posts: 281
Karma: 904
Join Date: Oct 2007
Location: Kent, UK
Device: iRex iLiad, Psion 5MX, nokia n800
Teasing 2: extract snippets/tag PDFs

I figured Rio shouldn't be the only tease around here so here's my contribution. It's a proof of concept that seems to be working, but is a tease because it uses a language that few are likely to use here (it uses R, a scripting language I am more familiar with than any other so can get things done with it).

Several people have expressed a desire for a way to highlight or annotate journal articles etc directly on the iliad. OK, we can annotate and underline using scribbles, but we can't currently do much with the results, except just read them; there's no way to search them for example.

Here's an approach that might get part way there. I can now mark up a PDF on the iliad using scribbles then run my script on the result (on my PC) and it will extract snippets of text that have been marked up. If I mark up the text with an L-shape the result is stored as a snippet; if I use an inverted L-shape the result is stored as a tag. See the attached PDF, which when processed produced the following results:
Code:
SNIPPETS:
Europe has already said it will press the G7 to demand
more disclosure from banks on their investments as the credit
crunch spreads from the financial sector to the household and
corporate sectors.

How bad will
a whole? This n
financial servi

A 
housi

property 

 Square Mile,

GDP 

 IMF

 FS

TAGS: 
chancellor
Some of this looks truncated weirdly, but that's because the selection area is deliberately short. See attached PDF for the marked-up page and it should make sense.

I use
Code:
pdftohtml -xml
to produce an xml representation of the PDF. This is easy to get for linux, might be available for Macs, but probably will be hard for people with windows. This is meant to be a proof of concept, and perhaps someone out there who knows java or C++ could use the poppler library or some other library to make a more platform independent solution.

TODO:
  1. Use a different pen colour for mark-up. That way I can easily have notes and mark-up and keep them separate.
  2. See if I can extract an image of the selected area. That way the same approach could be used to extract images or tables (which don't export well to text because they have no structure).
  3. Automatically integrate the output with my bibliography tool.
  4. Store the snippets and tags in the PDF container directory and create a search tool to search snippets and tags directly on the iliad (partially done).
  5. Encourage someone to rewrite this using java or something so it can be easily used by other people, or better still see if someone (not me) could do it by modifying ipdf directly.
  6. If no-one else will port it, then I'll eventually try to get around to moving it to jython.
Attached Files
File Type: pdf Business-merged-extract.pdf (48.5 KB, 504 views)

Last edited by daudi; 04-14-2008 at 05:32 AM. Reason: Clarified point about annotation, etc.
daudi is offline   Reply With Quote