MobileRead Forums - View Single Post - pacify.py (Text reformatter / RTF extractor)

ahi · 09-02-2009, 04:29 PM

Ok, ekaser... so here's my grand plan's reformulation:

The text is basically parsed into a pTome wrapping class...

Each pTome contains an arbitrary number of pPar objects, which are assumed to be either paragraphs or lines with necessary line-breaks (like poems or quotations).

Each pPar object has a 1) classification [e.g.: paragraph, quotation, {chapter/section} title, et cetera], 2) a pString object.

Each pString object has a 1) text string, 2) a formatting string.

The pTome class would have accessor methods to facilitate high-level "posing" of the sort of questions I identified in my earlier post, but instead of words being preparsed, they would be parsed only on the fly whenever an accessor method needed it. I do not foresee a need to perform word level operations, only to make word or higher level queries.

Some outstanding decisions on my mind...

1) color... I should probably include it in the formatting string... so I think I'll probably make the formatting "string" not work on the basis of bitfields but something a bit more complex, so if in the future I discover a reason to make the conversion from RTF or HTML more fine-grained, I can do so without much internal rewriting.

2) links, footnotes, annotations... I am thinking these might have to be their own parallel "strings" (not containing unicode bytes, but rather arbitrarily long sub-pStrings though, or destinations in the case of links). After all, a given character could be both part of a link, and be (right in front of) a footnote (mark). I'm not sure how annotations work in RTFs, but that might also coexist with the previous two in certain complex cases.

Can you think of a better way that doesn't introduce too much complexity?

With regards to the links, I think the link parallel string would only be a destination for location infromation "deposited" from a higher level... almost certainly by the owning pTome.

Basically... the following:

Code:

<h1>The Beginning</h1>

<p>
It was rather a new sort of experience<footnote>though
admittedly she's been on the run from the law before, but that
was a <i>long</i> time ago</footnote>, and she did not deal
well with it.  Or, rather, <b>it</b> did not deal kindly with her.
<p>

would turn into (and I simply, for the sake of being more readable):

Code:


pTome
|
|
|--- pPar[0]
|    |
|    |___ pClassification = "title"
|    |___ pString
|         |
|         |__ "The Beginning" # text
|         |__ "0000000000000" # formatting
|         |__ "0000000000000" # links
|         |__ "0000000000000" # annotations
|         |__ "0000000000000" # footnotes
|
|--- pPar[1]
     |
     |___ pClassification = "paragraph"
     |___ pString
          |
          |__ "It was rather a new sort of experience, and she did not deal well with it.  Or, rather, it did not deal kindly with her."
          |__ "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000bb000000000000000000000000000000" # formatting
          |__ "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000" # links
          |__ "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000" # annotations
          |__ "0000000000000000000000000000000000000*0000000000000000000000000000000000000000000000000000000000000000000000000000000000" # footnotes
                                                    |
                                                    |_ pString
                                                       |
                                                       |__ "though admittedly she's been on the run from the law before, but that was a long time ago"
                                                       |__ "0000000000000000000000000000000000000000000000000000000000000000000000000000iiii000000000"

Does this seem a reasonable way to go about it all?

- Ahi

09-02-2009, 04:29 PM	#39
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Grand Plan! Ok, ekaser... so here's my grand plan's reformulation: The text is basically parsed into a pTome wrapping class... Each pTome contains an arbitrary number of pPar objects, which are assumed to be either paragraphs or lines with necessary line-breaks (like poems or quotations). Each pPar object has a 1) classification [e.g.: paragraph, quotation, {chapter/section} title, et cetera], 2) a pString object. Each pString object has a 1) text string, 2) a formatting string. The pTome class would have accessor methods to facilitate high-level "posing" of the sort of questions I identified in my earlier post, but instead of words being preparsed, they would be parsed only on the fly whenever an accessor method needed it. I do not foresee a need to perform word level operations, only to make word or higher level queries. Some outstanding decisions on my mind... 1) color... I should probably include it in the formatting string... so I think I'll probably make the formatting "string" not work on the basis of bitfields but something a bit more complex, so if in the future I discover a reason to make the conversion from RTF or HTML more fine-grained, I can do so without much internal rewriting. 2) links, footnotes, annotations... I am thinking these might have to be their own parallel "strings" (not containing unicode bytes, but rather arbitrarily long sub-pStrings though, or destinations in the case of links). After all, a given character could be both part of a link, and be (right in front of) a footnote (mark). I'm not sure how annotations work in RTFs, but that might also coexist with the previous two in certain complex cases. Can you think of a better way that doesn't introduce too much complexity? With regards to the links, I think the link parallel string would only be a destination for location infromation "deposited" from a higher level... almost certainly by the owning pTome. Basically... the following: Code: <h1>The Beginning</h1> <p> It was rather a new sort of experience<footnote>though admittedly she's been on the run from the law before, but that was a <i>long</i> time ago</footnote>, and she did not deal well with it. Or, rather, <b>it</b> did not deal kindly with her. <p> would turn into (and I simply, for the sake of being more readable): Code: pTome \| \| \|--- pPar[0] \| \| \| \|___ pClassification = "title" \| \|___ pString \| \| \| \|__ "The Beginning" # text \| \|__ "0000000000000" # formatting \| \|__ "0000000000000" # links \| \|__ "0000000000000" # annotations \| \|__ "0000000000000" # footnotes \| \|--- pPar[1] \| \|___ pClassification = "paragraph" \|___ pString \| \|__ "It was rather a new sort of experience, and she did not deal well with it. Or, rather, it did not deal kindly with her." \|__ "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000bb000000000000000000000000000000" # formatting \|__ "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000" # links \|__ "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000" # annotations \|__ "0000000000000000000000000000000000000*0000000000000000000000000000000000000000000000000000000000000000000000000000000000" # footnotes \| \|_ pString \| \|__ "though admittedly she's been on the run from the law before, but that was a long time ago" \|__ "0000000000000000000000000000000000000000000000000000000000000000000000000000iiii000000000" Does this seem a reasonable way to go about it all? - Ahi