MobileRead Forums - View Single Post - pacify.py (Text reformatter / RTF extractor)

ahi · 09-03-2009, 12:02 AM

Quote:

Originally Posted by ekaser

While on the subject, remember that 'annotations' can take the form of footnotes at the bottom of the page, endnotes at the end of the chapter, and annotations at the end of the book. Three unique places you have to keep track of and place them.

He he... and that is the primary reason I don't know what they are supposed to be in the RTF specification's paradigm. But it'll be easy enough to figure out.

Quote:

Originally Posted by ekaser

Maybe, maybe not (assuming I'm understanding what you mean by "word level operations"). If you mean that you assume that each word is 'correct' and won't need any modification, I'd say that's false, unless you consider contractions (as an example) as two or three words. For example, I've seen where spaces get inserted before or after the apostrophe (which would turn one word into two), and you very well might want, at some point, to remove hypens as you wrap text, or re-hyphenate. Also, who knows? At some point WAY down the road, you (or someone else) might want to add a spell-checker or grammar-checker? Anyway, other than that possible thought, I'm with you so far.

I think part of what I was saying was sort of contaminate by my knowledge (or preferred way) of how this would be processed. Even the sort of examples you give, I think I would probably try to process via looping through the entire text, character by character, and making decisions on a per character basis about what to include in the output. The decisions would be informed by word- and higher level queries, but I don't necessarily think that I need to create the function output/return string (pString/pTome?) with the use of any code that operates on word's per se.

A function trying to detect an instance of a single quote being embedded within two words, with no space on either side (like "with an 'increased'chance of precipitation") could, while making queries related to trying to decide what to make of the apostrophe/single-quote character come to the conclusion that these are likely two words, and as a result write "' " ( '&space; ) into the output, instead of just the apostrophe, before continuing to the next character in the text.

Admittedly, I think it's not impossible that I might come to find myself wrong on this point of "not needing word-level accessors that change the data"... but it should be easy enough to ammend the list of methods with a few more accessors.

Quote:

Originally Posted by ekaser

I agree. Think forward. It's only a matter of time (1 year? 5 years? Certainly not more than 10 years!) before monochrome eReaders go the way of monochrome TVs and monochrome computer displays. FAR easier to design in the structure NOW than to go back and retrofit later!

Yeah. Perhaps I should design the formatting portion with full granularity to match or even exceed the level of detail preserved in RTF... and, for now, just have the intake functions happily ignore the stuff I perceive to be of no interest. (precise font size, and similar things)

Quote:

Originally Posted by ekaser

Hmmm... footnotes... they have to be tied to a specific point in the body of the text (their anchor)... Perhaps... (and perhaps you've already considered this and just didn't include the detail) a 'line' in a pPar does not necessarily ALWAYS end in a line break? Then, the footnote becomes just one more pPar, with the beginning of the 'line' being one pPar (withOUT a line break), the footnote being a second pPar (withOUT a line break), and the rest of the line a third pPar (with or without a line break). The pPar for the footnote would obviously have to be of TYPE 'footnote', and there would have to be a separate flag for whether it ended in a line-break or not (ie, if there's a line-break after the "[1]" in the text... a footnote can occur in the middle of a paragraph/line OR at the end of it, so you need THAT flag, and then the string (text) of the footnote may or may not have multiple line breaks within it. ACTUALLY... I've seen some pretty hoary footnotes... you MAY want a pPar TYPE that is basically a pointer to a whole new pTome, so that the body of the footnote has it's own 'environment'. I suspect that might simplify the handling of footnotes, in the long run.

Hmmm... you are absolutely right, and it troubles me a little.

pPar, for the sake of simplicity, needs to "end with a line break"... so to speak. In reality, I think I will end up stripping out line breaks, so the assumed line breaks at the end of pPar's will be crucial to the production of correct formatting in the output.

In other words, if I keep strictly with this idea, you are right in noting that a pPar might not be sufficient for every footnote... despite the vast majority of them almost certainly being just a few words or a line/paragraph.

Probably each footnote having its own pTome is the right idea... though the idea is unsettling at first thought, as I initially imagined pTome as containing the whole of a contiguous piece of text. I suspect though this is wholly subjective, with little foundation... as I cannot think of any obvious reason why this approach could/would cause problems.

Perhaps I need to think of pTome as a contiguous "grouping" of text. Whether that grouping is 1 book, 1 part, 1 chapter, or a mere 1 footnote/annotation.

Quote:

Originally Posted by ekaser

Ummm... I'm confused. Your diagram is showing (to me) strings of bytes/flags/etc lined up with each character, one for each of formatting, links, annotations, and footnotes. Is that really what you mean? If so, how would that be used to "link up" link, annotation and footnote information? I'd think you'd need a POINTER to a data structure of some sort for each of those (a pTome or pPar, whatever) that holds all the pertinent info???

It seems to me a bit of a waste of storage, as most of those fields (HOWEVER they work) are going to be empty most of the time. What jumps to my mind is something like this: stick with a single "attribute" string that is completely parallel with the text string (whether that's an array of BYTE, WORD, DWORD, or class/structure, doesn't matter). One of the flags in the attribute for a letter is "footnote here", another is "link start", another "link end" (maybe, I wave magic wand), another is "anotation here". Then, the pPar structure (class, if you will) contains a pointer to a linked list of objects which contain the data (footnote, link target, etc) for each of those. The linked list is in the same sequential order as the items appear on the line of the pPar, so as you flow along the pPar line, you have another pointer that "flows along" the linked list of sub-pPars that are footnotes, link targets, annotations, etc. Or maybe not. I'm just not seeing quite what you're visualizing for those extra 'fields'.

Yeah... I'm terrible at doing little ASCII drawings of programming structures.

For links, footnotes, and annotations, my original thought was that (in python terms) I would use a dictionary object for each of those. In essence what my diagram shows as "0" for those strings, in reality it would be numerical keys for which the dictionary was never assigned any data.

Code:

linkLayer = {}
linkLayer[28] = 'chapter 1, paragraph 5'
linkLayer[29] = linkLayer[28]
linkLayer[30] = linkLayer[28]
linkLayer[31] = linkLayer[28]
linkLayer[32] = linkLayer[28]

The above would have the link layer be entirely empty, excepting for character 29, 30, 31, 32, and 33 of the string. (Presumably a 5 letter world.) Read nothing into the crude link format I used ('chapter 1, paragraph 5') as I have given no thought at all yet to how I'd handle intra-document links... although now that I'm thinking of it, they'd almost certainly be handled (both in RTF and HTML) by means of anchors, which should be fairly simple.

While the above still may waste space needlessly, it is nowhere near as bad as my original diagram suggested... though I'd definitely look into "assigning by reference" with python when I get to this part of the code. At least one target of my conversion activity is basically The Great Hungarian Defining Dictionary (not an accurate translation of its name, but it describes fairly well what it is). Basically 100 MB of RTF, with only a few small pictures here and there... and potentially my way of wanting to convert this could easily have over 100,000 intra-document links.

Also, keep in mind, not treating links with "link open here" and "link end here" sort of solutions works to include links under the automatic formatting refactoring umbrella. Something though that has no relevance to annotations and footnotes, on account of those being "one dimensionally referenced" (insert here, as opposed to span this length).

I think the main reason for my wanting to have these three in additional parallel streams, in a way, is to be able to arbitrarily alter the main text without having to make my code do contortions to ignore the fact that (for a lot of practical considerations) a footnote does not really come between two characters of the main body text.

Quote:

Originally Posted by ekaser

Other than that, I think you're definitely on the right track!

Thanks. It's good to have constructive feedback!

I'm just sorry I won't really have time to get any meaningful work done on this until next week. But as I manage, I'll definitely update you of whatever developments/discoveries/realizations/thoughts along the way.

- Ahi