Ok, ekaser... so here's my grand plan's reformulation:
The text is basically parsed into a pTome wrapping class...
Each pTome contains an arbitrary number of pPar objects, which are assumed to be either paragraphs or lines with necessary line-breaks (like poems or quotations).
Each pPar object has a 1) classification [e.g.: paragraph, quotation, {chapter/section} title, et cetera], 2) a pString object.
Each pString object has a 1) text string, 2) a formatting string.
The pTome class would have accessor methods to facilitate high-level "posing" of the sort of questions I identified in my earlier post, but instead of words being preparsed, they would be parsed only on the fly whenever an accessor method needed it. I do not foresee a need to perform word level operations, only to make word or higher level queries.
Some outstanding decisions on my mind...
1) color... I should probably include it in the formatting string... so I think I'll probably make the formatting "string" not work on the basis of bitfields but something a bit more complex, so if in the future I discover a reason to make the conversion from RTF or HTML more fine-grained, I can do so without much internal rewriting.
2) links, footnotes, annotations... I am thinking these might have to be their own parallel "strings" (not containing unicode bytes, but rather arbitrarily long sub-pStrings though, or destinations in the case of links). After all, a given character could be both part of a link, and be (right in front of) a footnote (mark). I'm not sure how annotations work in RTFs, but that might also coexist with the previous two in certain complex cases.
Can you think of a better way that doesn't introduce too much complexity?
With regards to the links, I think the link parallel string would only be a destination for location infromation "deposited" from a higher level... almost certainly by the owning pTome.
Basically... the following:
Code:
<h1>The Beginning</h1>
<p>
It was rather a new sort of experience<footnote>though
admittedly she's been on the run from the law before, but that
was a <i>long</i> time ago</footnote>, and she did not deal
well with it. Or, rather, <b>it</b> did not deal kindly with her.
<p>
would turn into (and I simply, for the sake of being more readable):
Code:
pTome
|
|
|--- pPar[0]
| |
| |___ pClassification = "title"
| |___ pString
| |
| |__ "The Beginning" # text
| |__ "0000000000000" # formatting
| |__ "0000000000000" # links
| |__ "0000000000000" # annotations
| |__ "0000000000000" # footnotes
|
|--- pPar[1]
|
|___ pClassification = "paragraph"
|___ pString
|
|__ "It was rather a new sort of experience, and she did not deal well with it. Or, rather, it did not deal kindly with her."
|__ "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000bb000000000000000000000000000000" # formatting
|__ "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000" # links
|__ "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000" # annotations
|__ "0000000000000000000000000000000000000*0000000000000000000000000000000000000000000000000000000000000000000000000000000000" # footnotes
|
|_ pString
|
|__ "though admittedly she's been on the run from the law before, but that was a long time ago"
|__ "0000000000000000000000000000000000000000000000000000000000000000000000000000iiii000000000"
Does this seem a reasonable way to go about it all?
- Ahi