ebioman, thank you for your work with the pdfloc tags. I won't have much time until August to do further work, but I do want to get this working. A few quotations from this thread might help out.

Originally Posted by cedricp
I put a few lines of code to extract the information about the "text annotation", even though I don't know what to do with it. The format seems to be something like:
y: the vertical position in (lines x 8 + 1) in the page, ignoring blank lines
x1 and x2: seems both related to the position in chars, or the displacement
col: could be related to the columns or just a flag indicating the meaning of x1 and x2.
0 and 1 are always there in my tests.
Originally Posted by cedricp
If someone is so inclined, I think the JPedal library could help to recover the annotated text:

The coordinate system should even be consistent the the #pdfloc tab....
Originally Posted by computermacgyver
I would really love to get highlights to extract. I've played with the pdfloc tags, and it seems it of some format similar to
doc_id document_id? same for all annotations in a document
page page number starting from 0
para paragraph or line? starting from 0, restarting on new page
word word on line, starting at 0
char character starting from 0, restart numbering on new word
0 always 0?
g not sure, perhaps unimportant or related to deleted annotations?
1 always one?
