I did some testing with this earlier: created the same document as ePub and as PDF and selected the same text in both.
For PDF it uses the same pdfloc style to indicate start and end point. I couldn't find a way to reliably trace a pdfloc back to a specific point in the text though.
For the epub it used a different format, indicating the line and position on the line. Again: a start and end point... this does make me think that we'd need a seperate decoder per format... though with ePub, PDF and LRF covered we should have the most important ones covered.
If you want I could upload my sample files including annotations. From what I've seen so far the process would be:
- Find the annotation for a specific document
- Decode the start- and end-positions
- Extract the text between those positions from the original document
- Add any annotations entered by the user
- Output it all in some kind of readable format.