View Single Post
Old 04-14-2008, 12:15 PM   #4
daudi
Addict
daudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-booksdaudi has learned how to read e-books
 
Posts: 281
Karma: 904
Join Date: Oct 2007
Location: Kent, UK
Device: iRex iLiad, Psion 5MX, nokia n800
Quote:
Originally Posted by nekokami View Post
Wow! Please keep us posted. I'm very interested in this functionality. And I know a Java programmer who might help.
Will do. If you could ask your java programmer if there is any java-only way of getting an XML representation of a PDF that would help to direct things. At the moment I am using pdftohtml to do that. pdftohtml is based on xpdf code and I think this is the same as or related to the poppler library. AFAICT these are all C (C++?). I did a quick search to see if there is a java implementation of poppler but could not find anything. Perhaps there is some other java library that can do this, but I don't know enough about java to begin to look around efficiently.

The key thing is that the output of pdftohtml -xml is an xml file with the co-ordinates of each line of text. If there is a way this can be done using java alone then it will be easy to implement the rest and have the whole thing platform independent.

Quote:
Originally Posted by nekokami View Post
I've used R, but I had no idea you could do stuff like this with it.
I use R for my work almost every day (not necessarily very deeply) and so I'm more familiar with it than anything else. The R code to do this uses none of the things that makes R special, I'm just using it as a scripting language. I've used R for all sorts of things unrelated to statistics. I suppose one good reason for using R here it is so easy to use the plotting facilities of R to plot the scribbles. This helped with debugging.
daudi is offline   Reply With Quote