pacify.py (Text reformatter / RTF extractor) - Page 3

ahi · 09-01-2009, 01:58 PM

Quote:

Originally Posted by ekaser

EDIT: Note, I'm not suggesting you include a 'general' HTML or LaTeX parser, anymore than you're going to have a general RTF parser, just the "basic stuff" that you want to keep and throw everything else away. Sure, some files it would make a mess of, but those files probably wouldn't be appropriate for this style of conversion either. I'm assuming this is aimed at "simple novel" types of books that don't have a lot of fancy formatting to start with. One thought: since you're taking RTF as input files, some of those will have images (covers, maps, etc), so I'm hoping that those image tags would be maintained along with the bold, italic, etc, right? That would imply the need to be able to include a "numbered mark" in the formatting string. Perhaps if the most significant bit of the formatting 'character' was set, then the lower bits are the 'number' of the image (on the "image stack") that should be inserted at that point. Of course, that then also brings up the question of image positioning: left, center, right.

I assumed that's what you had meant. And yeah, this would be mainly for reasonably simple formatted stuff... though I am also doing work with a number of large dictionaries and lexicons... so if I figure out a sane way of autodetecting "the right way" to handle them, I'd include a switch for that too in the command line.

Ultimately, I think what would be achievable (and what is my short term goal):

1. Input plaintext from TXT or extract text with limited formatting information from RTF or HTML.
2. Fix-up character mish-mash... ("..." replace with "…", "--" with "–" or "—", et cetera)
3. Try to detect and, with user approval, correct erroneous paragraph breaks.
4. Smarten quotation marks.
5. Try to detect poems, letters, quotations and mark them somehow. (A third layer, with a single setting per line, as opposed to per unicode character?)
6. Try to detect part, chapter, section headers... possibly interactively with help from user to make more accurate.
7. Output with formatting intact into the chosen format.

In the case of simple novels, with no multi-level headers (i.e.: chapters, sections, et cetera) I think such a process should be able to create a nearly perfect file even without user interaction.

In the case of more complex novels, a fair bit of work would remain, but the existing "tagging" by the script ought to result in mostly false positives that would be fairly quick to correct.

- Ahi

ekaser · 09-01-2009, 02:03 PM

Quote:

Originally Posted by ahi

I've just been itching to get working code and, in retrospect, have churned out messy and architecturally lazy code in order to do so.

Perfectly understandable. Been there, done that. Still do that.

Quote:

I think I need to redo the architecture in order to have a clean "intake" portion that converts input files into the internal format, a processing portion that does whatever needs to be done, and an output portion that converts the internal format into the chosen output format.

Yes, wise.

Another thought regarding justification formatting: it's not character by character formatting, but rather paragraph by paragraph, as it were. Justification (left, center, right, full) could be handled by two (or four, your choice) bits in the format 'byte' for the end-of-paragraph/block character (CR, LF, NUL, whatever you choose to use). That character doesn't need to worry about the normal character formatting issues, so it can be used to hold 'block' formatting information. For easiest sequential processing, you might want said block formatting info to apply to the FOLLOWING paragraph rather than the preceding one, in which case you'd want to start the document with a special end-of-block character that never gets emitted to the output, which is there simply to hold the block formatting info for the first block.

Just stream of consciousness stuff, take it for what it's worth...

ekaser · 09-01-2009, 02:13 PM

Quote:

Originally Posted by ahi

5. Try to detect poems, letters, quotations and mark them somehow. (A third layer, with a single setting per line, as opposed to per unicode character?)
6. Try to detect part, chapter, section headers... possibly interactively with help from user to make more accurate.

See my other reply to your previous note, regarding using an "start of block" (not "end of block" as I was calling it before) character (which I assume you'll have to have, whether it's NUL, CR, LF or something else) as a place to hold formatting for the FOLLOWING block of text. You could easily use that one "formatting byte" associated with the "end of block" character to hold a bunch of things, to indicate the justification, indent, part, chapter, header, etc. In fact, I'd recommend having a number of "start of block" characters (you've got at least 32 of them, even if characters are only 1 byte wide, as 0-31 will never appear in 'normal' text), so if everything you need fits on the formatting byte for the single "start of block" byte fine (using CR, let's say), and if you need more, then you use alternate "start of block" characters that never emit any text output, but are simply placeholders that provide more information about the following block's formatting. That way, you maintain the ease of text processing/searching, while enabling the 'hiding' of a lot of block-oriented information in your formatting area.

Everett

ahi · 09-01-2009, 02:26 PM

Quote:

Originally Posted by ekaser

Another thought regarding justification formatting: it's not character by character formatting, but rather paragraph by paragraph, as it were. Justification (left, center, right, full) could be handled by two (or four, your choice) bits in the format 'byte' for the end-of-paragraph/block character (CR, LF, NUL, whatever you choose to use). That character doesn't need to worry about the normal character formatting issues, so it can be used to hold 'block' formatting information. For easiest sequential processing, you might want said block formatting info to apply to the FOLLOWING paragraph rather than the preceding one, in which case you'd want to start the document with a special end-of-block character that never gets emitted to the output, which is there simply to hold the block formatting info for the first block.

Just stream of consciousness stuff, take it for what it's worth...

I appreciate it more than you know! I've been hoping just for your sort of comments.

My original long-ago grand plan for pacify was to parse the input text in such a way that I can correlate any given character of the text to the word that said text is part of, the sentence that it is a part of, and the paragraph that it is a part of.

Doing so, I still think, would make for some very nice high-level processing possibilities.

Quote:

Is the current character a single quote?
Is there a single quote quoted portion already open?
Is there a space to the left or the right of the quote?
Does the word immediately before it end with an "s"?

or

Quote:

Are both the previous and the next paragraphs approximately as long as the current one?
Are all three shorter than the average line-length in most of the document?
Does the first short paragraph that started this series of short paragraphs start with a comma?
Does the last short paragraph at the end of this series of short paragaphs end with a comma?

It's already these sort of questions that my program asks... but it has no high-level function calls to accommodate such queries... so it's a bit harder to see/understand what it is doing at many places.

The reason I've abandoned this idea is because:

1) The interconnections between words, paragraphs/lines, and characters seems to me to be a bit more complex than I can readily visualize... and hence I'm not too sure how to go about it. Not to mention that some characters are not really words (commas, periods, apostrophes, dashes), but need to appear at the world level, and some characters are not part of a paragraph but should appear at paragraph level (a single newline character between paragraph, or three newlines at the end of a chapter, before the next chapter's title)... not sure how to deal with these.

2) Even if I could create a class that goes from plaintext into this almost "database" sort of format... (which does not even yet account for formatting or anything else) I haven't yet wrapped my mind around how I would updated all levels while doing text processing... or, rather, how to do so in a simple enough way not to fully counteract the simplicity of querying by the complexity of altering.

Still, I'd be curious to hear what you think.

- Ahi

ekaser · 09-01-2009, 02:37 PM

More thoughts on "buffered input" (if the file is too large and you can't figure out how to make Python handle the whole thing at once, and probably a good idea to do anyway, as you never know what kind of system someone might want to run it on, and what the limits of their resources will be)...

IF you're planning on generating a table of contents, or any other set of section listings at the front of the file, then what might be best is something like this:
1) Maintain, entirely in memory, the information about the overall structure of the file (sections, headings, etc) along with "section numbers" or spool file offset for each section. Normally, I wouldn't expect there to be more than a few dozen entries in this structure, maybe up to a few hundred, but that might change depending upon what all info you chose to put there.
2) Create the input buffer and initialize the overall structure (which would also involve starting the first section, it's first block, etc).
3) Create a temporary spool file to hold 'blocks' of processed data. Each time you finish a block in a fashion that you know you won't need to refer to it again until the "dump phase", write it out the temp spool. As you detect new sections/chapters/etc, create a new entry on the "overall structure" heap (see #1 above) with the needed information, including the current point in the spool file (length).
4) Process, process, process...
5) When finished with input, you now start the actual output file using the information in the overall structure, and then dump the rest of the file by re-reading the temp spool file and formatting it to the output in conjunction with the info in the overall structure heap.

Something along that line would allow you to process pretty much ANY size file on almost any machine, and would allow you to do Tables of Content, etc. pretty easily.

Another thing you could do is include, at the front of the document, notes to the user (perhaps command line option, or perhaps write to a separate output file, or just to standard out), information about potential problem areas, so the user would know where to look. ...But I think you were already doing that, so never mind.

ekaser · 09-01-2009, 03:19 PM

Quote:

Originally Posted by ahi

My original long-ago grand plan for pacify was to parse the input text in such a way that I can correlate any given character of the text to the word that said text is part of, the sentence that it is a part of, and the paragraph that it is a part of.

1) The interconnections between words, paragraphs/lines, and characters seems to me to be a bit more complex than I can readily visualize... and hence I'm not too sure how to go about it. Not to mention that some characters are not really words (commas, periods, apostrophes, dashes), but need to appear at the world level, and some characters are not part of a paragraph but should appear at paragraph level (a single newline character between paragraph, or three newlines at the end of a chapter, before the next chapter's title)... not sure how to deal with these.

2) Even if I could create a class that goes from plaintext into this almost "database" sort of format... (which does not even yet account for formatting or anything else) I haven't yet wrapped my mind around how I would updated all levels while doing text processing... or, rather, how to do so in a simple enough way not to fully counteract the simplicity of querying by the complexity of altering.

I agree, it's a complex issue. That makes it all the more important that you understant EXACTLY what it is you're trying (or want) to accomplish, and structure it right from the beginning. From reading your descriptions, it sounds like you want to make 'grammatical' changes as well as simple formatting.

It seems to me that you really do need to read the whole file in (and spooling it out to a temp file, and re-read the spool file to do more processing, and re-spool to another temp file, lather, rinse, repeat...), first breaking it up into 'apparent' sections, chapters, blocks, paragraphs, whatever. THEN, you can start processing what you've detected from the input, to see what you want to change. Maybe, roughly, something like this:

1) Read input file.
a) start block
b) add items (image, paragraph, links, etc) to spooled block
i) images are automatically a block unto themselves
ii) paragraphs of text are anything terminated by CR or LF or CR/LF
(you have to keep track of what end-of-line sequence the source
file uses and use the same thing again on the output).
iii) a link is a block, just like an image, but with associated text, and
is terminated by the end of link, not by EOL characters.
c) repeat a) and b) until entire file has been parsed in.

Step 1) is all about reading in and parsing, NO modification (other than possible removal of duplicated tags, like bolds and italics, that will happen naturally as a part of the conversion to the internal format.

2) Start processing the file.
a) Look for structural changes first.
i) Are there chapters/sections?
If so, are any of them missing?
Are any of them duplicated?
Try to straighten these out first.
ii) Once the chapters/sections are located/fixed, try to find indents.
These are indented quotes, poetry, etc. Just identify them.
iii) Try to identify intentional extra line breaks (between what looks
like the end of one paragraph and start of another, "scene breaks")
iv) Remove all other extaneous line breaks that aren't scene breaks
and which don't go before or after indented quotes/poetry, ie.
extraneous duplicate sequential line breaks.
b) Look for grammatical problems (this is the toughest part!), the kinds
of things you've already called out. Most of these will remain WITHIN
a paragraph-block, but it will also sometimes involve joining two
paragraphs together that have been eroneously split.

Step 2) will involve writing several copies of the temp spool file, as each successive output file becomes the input file to the next phase of things to look for. Once you have run out of things to look for and process, then:

3) Read the final temp spool file and output the final output file.

See, simple as that. Nothing to it.

Truly, that's the way I'd approach it. It's a long hard slog, because you're not getting to the really cool, slick stuff for quite a while, but I'd first get all of the input/parsing/spooling/output stuff working nice and clean. Once that is done, then you can almost forget about it and focus on the 'good' stuff, and can take each part of it a step at a time, not worrying about how many times you read and write temporary spool files.

Another thing that occurs to me: there's probably going to be a bunch of the changes that you make that you won't want to output to the user right away, as you'll want to refer them to specific line #'s of the final output file. So you may want to insert "change blocks" into the stream as you're processing. You can think of those "change blocks" as "debug information", and they will be completely omitted from the final output file and are simply skipped as you're working on the file, but their purpose is to then generate the console/debug output information to the user AS you're writing that final output file... "Removed 3 blank lines at line 23", "Joined broken paragraphs at line 79", etc.

All intended merely as "food for thought"...

EDIT: Well, hell, I screwed the pooch on THAT formatting!

ekaser · 09-01-2009, 06:59 PM

I'm currently trying to "pretty up" a Bujold book (one I bought from Baen, that unfortunately CAME this way

). Wherever they have italicized text, they've screwed up the spacing. Several examples:

Quote:

It was a setup, and I've just taken the bait, and they're letting the line play out.Considering what he knew of

Notice no space after the first italicized sentence and the second sentence.

Quote:

of the sortour department

No space between 'sort' and italicized 'our'.

Quote:

"In any case, it'stime ," Ivan reiterated.

In general, they seem to delete leading space when turning italics ON and sometimes inserting an extraneous space when turning them OFF (usually when followed by punctuation).

Anyway, I thought you might like to add these examples to your file of possible things to look for.

ahi · 09-01-2009, 08:47 PM

Quote:

Originally Posted by ekaser

I'm currently trying to "pretty up" a Bujold book (one I bought from Baen, that unfortunately CAME this way

). Wherever they have italicized text, they've screwed up the spacing. Several examples:

Notice no space after the first italicized sentence and the second sentence.

No space between 'sort' and italicized 'our'.

In general, they seem to delete leading space when turning italics ON and sometimes inserting an extraneous space when turning them OFF (usually when followed by punctuation).

Anyway, I thought you might like to add these examples to your file of possible things to look for.

Yeah, glad to have suggestions like these. Keep them coming when they come to you! And thanks for comments on the other stuff too...

I'll share with you tomorrow a potential solution to my "grand plan".

- Ahi

ahi · 09-02-2009, 03:29 PM

Ok, ekaser... so here's my grand plan's reformulation:

The text is basically parsed into a pTome wrapping class...

Each pTome contains an arbitrary number of pPar objects, which are assumed to be either paragraphs or lines with necessary line-breaks (like poems or quotations).

Each pPar object has a 1) classification [e.g.: paragraph, quotation, {chapter/section} title, et cetera], 2) a pString object.

Each pString object has a 1) text string, 2) a formatting string.

The pTome class would have accessor methods to facilitate high-level "posing" of the sort of questions I identified in my earlier post, but instead of words being preparsed, they would be parsed only on the fly whenever an accessor method needed it. I do not foresee a need to perform word level operations, only to make word or higher level queries.

Some outstanding decisions on my mind...

1) color... I should probably include it in the formatting string... so I think I'll probably make the formatting "string" not work on the basis of bitfields but something a bit more complex, so if in the future I discover a reason to make the conversion from RTF or HTML more fine-grained, I can do so without much internal rewriting.

2) links, footnotes, annotations... I am thinking these might have to be their own parallel "strings" (not containing unicode bytes, but rather arbitrarily long sub-pStrings though, or destinations in the case of links). After all, a given character could be both part of a link, and be (right in front of) a footnote (mark). I'm not sure how annotations work in RTFs, but that might also coexist with the previous two in certain complex cases.

Can you think of a better way that doesn't introduce too much complexity?

With regards to the links, I think the link parallel string would only be a destination for location infromation "deposited" from a higher level... almost certainly by the owning pTome.

Basically... the following:

Code:

<h1>The Beginning</h1>

<p>
It was rather a new sort of experience<footnote>though
admittedly she's been on the run from the law before, but that
was a <i>long</i> time ago</footnote>, and she did not deal
well with it.  Or, rather, <b>it</b> did not deal kindly with her.
<p>

would turn into (and I simply, for the sake of being more readable):

Code:


pTome
|
|
|--- pPar[0]
|    |
|    |___ pClassification = "title"
|    |___ pString
|         |
|         |__ "The Beginning" # text
|         |__ "0000000000000" # formatting
|         |__ "0000000000000" # links
|         |__ "0000000000000" # annotations
|         |__ "0000000000000" # footnotes
|
|--- pPar[1]
     |
     |___ pClassification = "paragraph"
     |___ pString
          |
          |__ "It was rather a new sort of experience, and she did not deal well with it.  Or, rather, it did not deal kindly with her."
          |__ "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000bb000000000000000000000000000000" # formatting
          |__ "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000" # links
          |__ "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000" # annotations
          |__ "0000000000000000000000000000000000000*0000000000000000000000000000000000000000000000000000000000000000000000000000000000" # footnotes
                                                    |
                                                    |_ pString
                                                       |
                                                       |__ "though admittedly she's been on the run from the law before, but that was a long time ago"
                                                       |__ "0000000000000000000000000000000000000000000000000000000000000000000000000000iiii000000000"

Does this seem a reasonable way to go about it all?

- Ahi

ekaser · 09-02-2009, 09:56 PM

Quote:

Originally Posted by ahi

The pTome class would have accessor methods to facilitate high-level "posing" of the sort of questions I identified in my earlier post, but instead of words being preparsed, they would be parsed only on the fly whenever an accessor method needed it. I do not foresee a need to perform word level operations, only to make word or higher level queries.

Maybe, maybe not (assuming I'm understanding what you mean by "word level operations"). If you mean that you assume that each word is 'correct' and won't need any modification, I'd say that's false, unless you consider contractions (as an example) as two or three words. For example, I've seen where spaces get inserted before or after the apostrophe (which would turn one word into two), and you very well might want, at some point, to remove hypens as you wrap text, or re-hyphenate. Also, who knows? At some point WAY down the road, you (or someone else) might want to add a spell-checker or grammar-checker? Anyway, other than that possible thought, I'm with you so far.

Quote:

1) color... I should probably include it in the formatting string... so I think I'll probably make the formatting "string" not work on the basis of bitfields but something a bit more complex, so if in the future I discover a reason to make the conversion from RTF or HTML more fine-grained, I can do so without much internal rewriting.

I agree. Think forward. It's only a matter of time (1 year? 5 years? Certainly not more than 10 years!) before monochrome eReaders go the way of monochrome TVs and monochrome computer displays. FAR easier to design in the structure NOW than to go back and retrofit later!

Quote:

2) links, footnotes, annotations... I am thinking these might have to be their own parallel "strings" (not containing unicode bytes, but rather arbitrarily long sub-pStrings though, or destinations in the case of links). After all, a given character could be both part of a link, and be (right in front of) a footnote (mark). I'm not sure how annotations work in RTFs, but that might also coexist with the previous two in certain complex cases.

Hmmm... footnotes... they have to be tied to a specific point in the body of the text (their anchor)... Perhaps... (and perhaps you've already considered this and just didn't include the detail) a 'line' in a pPar does not necessarily ALWAYS end in a line break? Then, the footnote becomes just one more pPar, with the beginning of the 'line' being one pPar (withOUT a line break), the footnote being a second pPar (withOUT a line break), and the rest of the line a third pPar (with or without a line break). The pPar for the footnote would obviously have to be of TYPE 'footnote', and there would have to be a separate flag for whether it ended in a line-break or not (ie, if there's a line-break after the "[1]" in the text... a footnote can occur in the middle of a paragraph/line OR at the end of it, so you need THAT flag, and then the string (text) of the footnote may or may not have multiple line breaks within it. ACTUALLY... I've seen some pretty hoary footnotes... you MAY want a pPar TYPE that is basically a pointer to a whole new pTome, so that the body of the footnote has it's own 'environment'. I suspect that might simplify the handling of footnotes, in the long run.

End of rambling thoughts...

[EDIT: after typing all that, I've now read further and seen there's more on footnotes in your note...

]

Quote:

Can you think of a better way that doesn't introduce too much complexity?

I think you're moving in the right directions. The Devil's in the details...

Quote:

With regards to the links, I think the link parallel string would only be a destination for location infromation "deposited" from a higher level... almost certainly by the owning pTome.

Basically... the following:

Ummm... I'm confused. Your diagram is showing (to me) strings of bytes/flags/etc lined up with each character, one for each of formatting, links, annotations, and footnotes. Is that really what you mean? If so, how would that be used to "link up" link, annotation and footnote information? I'd think you'd need a POINTER to a data structure of some sort for each of those (a pTome or pPar, whatever) that holds all the pertinent info???

It seems to me a bit of a waste of storage, as most of those fields (HOWEVER they work) are going to be empty most of the time. What jumps to my mind is something like this: stick with a single "attribute" string that is completely parallel with the text string (whether that's an array of BYTE, WORD, DWORD, or class/structure, doesn't matter). One of the flags in the attribute for a letter is "footnote here", another is "link start", another "link end" (maybe, I wave magic wand), another is "anotation here". Then, the pPar structure (class, if you will) contains a pointer to a linked list of objects which contain the data (footnote, link target, etc) for each of those. The linked list is in the same sequential order as the items appear on the line of the pPar, so as you flow along the pPar line, you have another pointer that "flows along" the linked list of sub-pPars that are footnotes, link targets, annotations, etc. Or maybe not. I'm just not seeing quite what you're visualizing for those extra 'fields'.

Other than that, I think you're definitely on the right track!

ekaser · 09-02-2009, 10:09 PM

While on the subject, remember that 'annotations' can take the form of footnotes at the bottom of the page, endnotes at the end of the chapter, and annotations at the end of the book. Three unique places you have to keep track of and place them.

ahi · 09-03-2009, 12:02 AM

Quote:

Originally Posted by ekaser

While on the subject, remember that 'annotations' can take the form of footnotes at the bottom of the page, endnotes at the end of the chapter, and annotations at the end of the book. Three unique places you have to keep track of and place them.

He he... and that is the primary reason I don't know what they are supposed to be in the RTF specification's paradigm. But it'll be easy enough to figure out.

Quote:

Originally Posted by ekaser

Maybe, maybe not (assuming I'm understanding what you mean by "word level operations"). If you mean that you assume that each word is 'correct' and won't need any modification, I'd say that's false, unless you consider contractions (as an example) as two or three words. For example, I've seen where spaces get inserted before or after the apostrophe (which would turn one word into two), and you very well might want, at some point, to remove hypens as you wrap text, or re-hyphenate. Also, who knows? At some point WAY down the road, you (or someone else) might want to add a spell-checker or grammar-checker? Anyway, other than that possible thought, I'm with you so far.

I think part of what I was saying was sort of contaminate by my knowledge (or preferred way) of how this would be processed. Even the sort of examples you give, I think I would probably try to process via looping through the entire text, character by character, and making decisions on a per character basis about what to include in the output. The decisions would be informed by word- and higher level queries, but I don't necessarily think that I need to create the function output/return string (pString/pTome?) with the use of any code that operates on word's per se.

A function trying to detect an instance of a single quote being embedded within two words, with no space on either side (like "with an 'increased'chance of precipitation") could, while making queries related to trying to decide what to make of the apostrophe/single-quote character come to the conclusion that these are likely two words, and as a result write "' " ( '&space; ) into the output, instead of just the apostrophe, before continuing to the next character in the text.

Admittedly, I think it's not impossible that I might come to find myself wrong on this point of "not needing word-level accessors that change the data"... but it should be easy enough to ammend the list of methods with a few more accessors.

Quote:

Originally Posted by ekaser

I agree. Think forward. It's only a matter of time (1 year? 5 years? Certainly not more than 10 years!) before monochrome eReaders go the way of monochrome TVs and monochrome computer displays. FAR easier to design in the structure NOW than to go back and retrofit later!

Yeah. Perhaps I should design the formatting portion with full granularity to match or even exceed the level of detail preserved in RTF... and, for now, just have the intake functions happily ignore the stuff I perceive to be of no interest. (precise font size, and similar things)

Quote:

Originally Posted by ekaser

Hmmm... footnotes... they have to be tied to a specific point in the body of the text (their anchor)... Perhaps... (and perhaps you've already considered this and just didn't include the detail) a 'line' in a pPar does not necessarily ALWAYS end in a line break? Then, the footnote becomes just one more pPar, with the beginning of the 'line' being one pPar (withOUT a line break), the footnote being a second pPar (withOUT a line break), and the rest of the line a third pPar (with or without a line break). The pPar for the footnote would obviously have to be of TYPE 'footnote', and there would have to be a separate flag for whether it ended in a line-break or not (ie, if there's a line-break after the "[1]" in the text... a footnote can occur in the middle of a paragraph/line OR at the end of it, so you need THAT flag, and then the string (text) of the footnote may or may not have multiple line breaks within it. ACTUALLY... I've seen some pretty hoary footnotes... you MAY want a pPar TYPE that is basically a pointer to a whole new pTome, so that the body of the footnote has it's own 'environment'. I suspect that might simplify the handling of footnotes, in the long run.

Hmmm... you are absolutely right, and it troubles me a little.

pPar, for the sake of simplicity, needs to "end with a line break"... so to speak. In reality, I think I will end up stripping out line breaks, so the assumed line breaks at the end of pPar's will be crucial to the production of correct formatting in the output.

In other words, if I keep strictly with this idea, you are right in noting that a pPar might not be sufficient for every footnote... despite the vast majority of them almost certainly being just a few words or a line/paragraph.

Probably each footnote having its own pTome is the right idea... though the idea is unsettling at first thought, as I initially imagined pTome as containing the whole of a contiguous piece of text. I suspect though this is wholly subjective, with little foundation... as I cannot think of any obvious reason why this approach could/would cause problems.

Perhaps I need to think of pTome as a contiguous "grouping" of text. Whether that grouping is 1 book, 1 part, 1 chapter, or a mere 1 footnote/annotation.

Quote:

Originally Posted by ekaser

Ummm... I'm confused. Your diagram is showing (to me) strings of bytes/flags/etc lined up with each character, one for each of formatting, links, annotations, and footnotes. Is that really what you mean? If so, how would that be used to "link up" link, annotation and footnote information? I'd think you'd need a POINTER to a data structure of some sort for each of those (a pTome or pPar, whatever) that holds all the pertinent info???

It seems to me a bit of a waste of storage, as most of those fields (HOWEVER they work) are going to be empty most of the time. What jumps to my mind is something like this: stick with a single "attribute" string that is completely parallel with the text string (whether that's an array of BYTE, WORD, DWORD, or class/structure, doesn't matter). One of the flags in the attribute for a letter is "footnote here", another is "link start", another "link end" (maybe, I wave magic wand), another is "anotation here". Then, the pPar structure (class, if you will) contains a pointer to a linked list of objects which contain the data (footnote, link target, etc) for each of those. The linked list is in the same sequential order as the items appear on the line of the pPar, so as you flow along the pPar line, you have another pointer that "flows along" the linked list of sub-pPars that are footnotes, link targets, annotations, etc. Or maybe not. I'm just not seeing quite what you're visualizing for those extra 'fields'.

Yeah... I'm terrible at doing little ASCII drawings of programming structures.

For links, footnotes, and annotations, my original thought was that (in python terms) I would use a dictionary object for each of those. In essence what my diagram shows as "0" for those strings, in reality it would be numerical keys for which the dictionary was never assigned any data.

Code:

linkLayer = {}
linkLayer[28] = 'chapter 1, paragraph 5'
linkLayer[29] = linkLayer[28]
linkLayer[30] = linkLayer[28]
linkLayer[31] = linkLayer[28]
linkLayer[32] = linkLayer[28]

The above would have the link layer be entirely empty, excepting for character 29, 30, 31, 32, and 33 of the string. (Presumably a 5 letter world.) Read nothing into the crude link format I used ('chapter 1, paragraph 5') as I have given no thought at all yet to how I'd handle intra-document links... although now that I'm thinking of it, they'd almost certainly be handled (both in RTF and HTML) by means of anchors, which should be fairly simple.

While the above still may waste space needlessly, it is nowhere near as bad as my original diagram suggested... though I'd definitely look into "assigning by reference" with python when I get to this part of the code. At least one target of my conversion activity is basically The Great Hungarian Defining Dictionary (not an accurate translation of its name, but it describes fairly well what it is). Basically 100 MB of RTF, with only a few small pictures here and there... and potentially my way of wanting to convert this could easily have over 100,000 intra-document links.

Also, keep in mind, not treating links with "link open here" and "link end here" sort of solutions works to include links under the automatic formatting refactoring umbrella. Something though that has no relevance to annotations and footnotes, on account of those being "one dimensionally referenced" (insert here, as opposed to span this length).

I think the main reason for my wanting to have these three in additional parallel streams, in a way, is to be able to arbitrarily alter the main text without having to make my code do contortions to ignore the fact that (for a lot of practical considerations) a footnote does not really come between two characters of the main body text.

Quote:

Originally Posted by ekaser

Other than that, I think you're definitely on the right track!

Thanks. It's good to have constructive feedback!

I'm just sorry I won't really have time to get any meaningful work done on this until next week. But as I manage, I'll definitely update you of whatever developments/discoveries/realizations/thoughts along the way.

- Ahi

ahi · 09-03-2009, 09:58 AM

Next consideration...

... how to architect this for internationalization. My own personal need is already related to two languages, English and Hungarian, and I would prefer adding additional ones to be a reasonably straightforward process.

Part me of wonders whether the easiest thing might be to separate out the processing functions into language-specific python modules to be included by the main pacify.py script as per need. But I'm all but certain that is at least an inelegant and probably a very substandard approach. What if, after all, a given document has text in two languages (and both RTF and HTML have language tagging capacity... so applying the appropriate rules is at least conceivable).

I'm also aware that the best case scenario would be to somehow store the rules in an externally stored and loadable data/config file of some sort. But I it might ultimately end up being overly complicated...

How do you store as a data file a rule like:

Quote:

When encountering an apostrophe, if the document doesn't already contain smart single quotes, check whether a single quoted section is already open. If it is, and there is a space to the right of the apostrophe, consider it a closing apostrophe, unless it follows the letter S, in which case look further ahead to ascertain whether there is a better "closing single quote" candidate further in the paragraph... et cetera, et cetera, et cetera.

To store that externally, I'd be forcing myself to create a quasi-scripting language, I think. Which, given that python itself is already a scripting language, seems silly.

Maybe have the main program contain a "skeleton" of all processing functions, which would then (based on command line options and/or imported metadata) in turn call language-specific versions of the processing questions on the fly at runtime?

FixParagraphs(text) would load FixParagraphs.hu.py or FixParagraphs.en.py depending on language and do an eval("FixParagraphs_"+curlang+"(text)")

I'll continue thinking on it further...

- Ahi

ekaser · 09-03-2009, 12:11 PM

Quote:

Originally Posted by ahi

Perhaps I should design the formatting portion with full granularity to match or even exceed the level of detail preserved in RTF... and, for now, just have the intake functions happily ignore the stuff I perceive to be of no interest. (precise font size, and similar things)

It certainly seems to me that you should DESIGN for the possibility of using those things in the future, design to be as 'inclusive' as possible, while just stubbing-out and/or ignoring most of that stuff for now. But definitely at least put thought into how to support a lot of that stuff even if you may never use it. Extra thought now prevents MAJOR pain later... :-)

Quote:

Hmmm... you are absolutely right, and it troubles me a little.

I'm confronted with that reaction a lot, but most folks refuse to admit it...

Quote:

pPar, for the sake of simplicity, needs to "end with a line break"... so to speak. In reality, I think I will end up stripping out line breaks, so the assumed line breaks at the end of pPar's will be crucial to the production of correct formatting in the output.

I'm not sure I understand. In my mind, there's a high likelihood that you'll be adding and removing "end of line breaks" as you format, so they're really fairly fluid. It seems to me that really what you want, instead of pTome and pPar, is a pBlock and pItem, where a pBlock is a 'collection' of pBlocks AND pItems, and a pItem is a discrete 'piece' of text. (Call them what you will, I'm just using different terms than you to facilitate easier differentiation from what thoughts you have attached to pTome and pPar.) A pItem is always a 'chunk' of text, plain, simple, no nested or sub-sections of text, no attached footnotes, etc:

An image
A sentence
A fragment of a sentence (down to possibly just a single word)
A break

A paragraph is just a pBlock of sentences with a paragraph-break at the end. Potentially, a 'sentence' could be defined as "all the text in the paragraph or any sub-portion thereof", OR as "a standard sentence that ends with a period" or some such. I'd tend towards the former.

A 'break' can be any one of:

Line break (such as at the end of each line of poetry)
Paragraph break (may trigger auto-first-line indent, etc)
Scene break (those extra blank lines in novels that separate scene changes)
Page break
Sub-section break
Section break
Chapter break
Part break
Book break

"Chapter-break" is different from "page-break"... page-break is just "start a new page", where "chapter-break" COULD be either "start a new page" OR "start a new EVEN numbered page" OR "start a new ODD numbered page"). For novels, you're just looking at Line, Paragraph, Scene, Page and/or Chapter, and possibly Part breaks. For textbooks and some other non-fiction books, Section and multi-level Sub-section breaks (ie, numbered sub-sections like 2.4.1, etc) come into play.

A pBlock then, is a COLLECTION (or LIST) of pBlocks AND pItems, representing anything including:

The entire 'book'
"Meta information" (not sure of right word, but title, author, copyright, etc)
A table of contents
A section or 'Part'
A chapter
A sub-chapter (potentially multi-level, as in text books, etc)
A header or footer (are you thinking of supporting those?)
A link
An anchor (target of a link)
A footnote, endnote, annotation
An index (a whole different can of worms)
A pItem

"Meta information" can apply to both a Book and also to Parts and Chapters, as they have titles, too.

In other words, the Book is a pBlock containing a list of pBlocks comprising the Meta Info, TOC, a list of Parts OR a list of Chapters, maybe Annotations, maybe Index, and a Book-break.

A PART pBlock would consist mostly of Chapter pBlocks (with maybe leading title (1,2,3 or I,II,II, etc) and/or quote/title).

A CHAPTER pBlock would consist mostly of a collection of sentences and paragraph-breaks.

And so on...

Only at the sentence and sentence-fragment level would you have the formatting strings.

However, I'm not sure how all of that would fit into your ideas for "ease of testing and reformatting" the text. That's just how my brain tends to approach things: break them down, structure them into as few types of nesting and repeating items as possible. Whatever you do, you have to encode in your 'database' the STRUCTURE of the book, the FORMATTING of the text, and the TEXT itself.

I'm just trying to throw grist into your mill...

More random grist: bottom-of-page footnotes are problems. Usually, they will get processed as soon as the "footnote link" is encountered, and their height gets removed from the height of that current page (or whatever), but sometimes the footnote reference will be too close to the bottom of the page for the entire footnote to fit in the remaining space on that page, so it has to get split and run over onto the bottom of the next page. Just grist. Another thought: it MAY be possible to use your 'standard' link/anchor mechasim for footnotes (after all, that's really what they are, the original hyperlinks). The only difference is where/how the target text is automatically placed.

Quote:

Perhaps I need to think of pTome as a contiguous "grouping" of text. Whether that grouping is 1 book, 1 part, 1 chapter, or a mere 1 footnote/annotation.

Yes, I think so, ie, similar to my pBlock discussion above.

Quote:

At least one target of my conversion activity is basically The Great Hungarian Defining Dictionary (not an accurate translation of its name, but it describes fairly well what it is). Basically 100 MB of RTF, with only a few small pictures here and there... and potentially my way of wanting to convert this could easily have over 100,000 intra-document links.

Yikes! That could DEFINITELY be a challenge to hold all of that in memory at once, after you've converted everyting into 'database' form. Building in file-spooling up front is, I think, a must.

Quote:

Also, keep in mind, not treating links with "link open here" and "link end here" sort of solutions works to include links under the automatic formatting refactoring umbrella. Something though that has no relevance to annotations and footnotes, on account of those being "one dimensionally referenced" (insert here, as opposed to span this length).

I'm not sure I entirely parsed those sentences the way you meant, but really there's no reason a footnote 'link' needs to be handled any differently than a 'hyperlink'. Yes, the footnote link is generally a 1-character link, but sometimes it's ** or *** or [1], [2], etc, and once you have to handle more than 1 character, there's really no difference between a footnote link and a hyperlink of arbitrary length. The only difference is that a footnote has to have it's target text placed automatically, where the hyperlink target just "happens in the flow of the document".

Quote:

I think the main reason for my wanting to have these three in additional parallel streams, in a way, is to be able to arbitrarily alter the main text without having to make my code do contortions to ignore the fact that (for a lot of practical considerations) a footnote does not really come between two characters of the main body text.

The 'footnote' link or the 'footnote' text (target)? The link DOES come between two characters of the text, just like a hyperlink does, and flows along with it, just like any other 'railcar' in the 'train'. Or maybe I'm misunderstanding again?

ekaser · 09-03-2009, 12:18 PM

Quote:

Originally Posted by ahi

Next consideration: how to architect this for internationalization.

Maybe have the main program contain a "skeleton" of all processing functions, which would then (based on command line options and/or imported metadata) in turn call language-specific versions of the processing questions on the fly at runtime?

FixParagraphs(text) would load FixParagraphs.hu.py or FixParagraphs.en.py depending on language and do an eval("FixParagraphs_"+curlang+"(text)")

Truly, a nasty problem.

Formatting of the text is DEFINITELY something that can vary greatly from language to language. For example, in English, a question always has a question mark at the END of the sentence. In Spanish, there's a mark at the START and end of the sentence.

I'm afraid I'm not multi-lingual aware/talented enough to help much with this one. MY thought would be that you'd almost need completely separate "reformatting modules" for each language, as trying to come up with any kind of scripting or rule-based structure seems incredibly complex and difficult. You almost have to custom-create it for each language. But then, there are other folks MUCH smarter and more educated in that subject than me!

09-01-2009, 02:37 PM	#35
ekaser Opinion Artiste Posts: 301 Karma: 61464 Join Date: Mar 2009 Location: Albany, OR Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire	Buffered input More thoughts on "buffered input" (if the file is too large and you can't figure out how to make Python handle the whole thing at once, and probably a good idea to do anyway, as you never know what kind of system someone might want to run it on, and what the limits of their resources will be)... IF you're planning on generating a table of contents, or any other set of section listings at the front of the file, then what might be best is something like this: 1) Maintain, entirely in memory, the information about the overall structure of the file (sections, headings, etc) along with "section numbers" or spool file offset for each section. Normally, I wouldn't expect there to be more than a few dozen entries in this structure, maybe up to a few hundred, but that might change depending upon what all info you chose to put there. 2) Create the input buffer and initialize the overall structure (which would also involve starting the first section, it's first block, etc). 3) Create a temporary spool file to hold 'blocks' of processed data. Each time you finish a block in a fashion that you know you won't need to refer to it again until the "dump phase", write it out the temp spool. As you detect new sections/chapters/etc, create a new entry on the "overall structure" heap (see #1 above) with the needed information, including the current point in the spool file (length). 4) Process, process, process... 5) When finished with input, you now start the actual output file using the information in the overall structure, and then dump the rest of the file by re-reading the temp spool file and formatting it to the output in conjunction with the info in the overall structure heap. Something along that line would allow you to process pretty much ANY size file on almost any machine, and would allow you to do Tables of Content, etc. pretty easily. Another thing you could do is include, at the front of the document, notes to the user (perhaps command line option, or perhaps write to a separate output file, or just to standard out), information about potential problem areas, so the user would know where to look. ...But I think you were already doing that, so never mind.

09-02-2009, 03:29 PM	#39
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Grand Plan! Ok, ekaser... so here's my grand plan's reformulation: The text is basically parsed into a pTome wrapping class... Each pTome contains an arbitrary number of pPar objects, which are assumed to be either paragraphs or lines with necessary line-breaks (like poems or quotations). Each pPar object has a 1) classification [e.g.: paragraph, quotation, {chapter/section} title, et cetera], 2) a pString object. Each pString object has a 1) text string, 2) a formatting string. The pTome class would have accessor methods to facilitate high-level "posing" of the sort of questions I identified in my earlier post, but instead of words being preparsed, they would be parsed only on the fly whenever an accessor method needed it. I do not foresee a need to perform word level operations, only to make word or higher level queries. Some outstanding decisions on my mind... 1) color... I should probably include it in the formatting string... so I think I'll probably make the formatting "string" not work on the basis of bitfields but something a bit more complex, so if in the future I discover a reason to make the conversion from RTF or HTML more fine-grained, I can do so without much internal rewriting. 2) links, footnotes, annotations... I am thinking these might have to be their own parallel "strings" (not containing unicode bytes, but rather arbitrarily long sub-pStrings though, or destinations in the case of links). After all, a given character could be both part of a link, and be (right in front of) a footnote (mark). I'm not sure how annotations work in RTFs, but that might also coexist with the previous two in certain complex cases. Can you think of a better way that doesn't introduce too much complexity? With regards to the links, I think the link parallel string would only be a destination for location infromation "deposited" from a higher level... almost certainly by the owning pTome. Basically... the following: Code: <h1>The Beginning</h1> <p> It was rather a new sort of experience<footnote>though admittedly she's been on the run from the law before, but that was a <i>long</i> time ago</footnote>, and she did not deal well with it. Or, rather, <b>it</b> did not deal kindly with her. <p> would turn into (and I simply, for the sake of being more readable): Code: pTome \| \| \|--- pPar[0] \| \| \| \|___ pClassification = "title" \| \|___ pString \| \| \| \|__ "The Beginning" # text \| \|__ "0000000000000" # formatting \| \|__ "0000000000000" # links \| \|__ "0000000000000" # annotations \| \|__ "0000000000000" # footnotes \| \|--- pPar[1] \| \|___ pClassification = "paragraph" \|___ pString \| \|__ "It was rather a new sort of experience, and she did not deal well with it. Or, rather, it did not deal kindly with her." \|__ "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000bb000000000000000000000000000000" # formatting \|__ "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000" # links \|__ "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000" # annotations \|__ "0000000000000000000000000000000000000*0000000000000000000000000000000000000000000000000000000000000000000000000000000000" # footnotes \| \|_ pString \| \|__ "though admittedly she's been on the run from the law before, but that was a long time ago" \|__ "0000000000000000000000000000000000000000000000000000000000000000000000000000iiii000000000" Does this seem a reasonable way to go about it all? - Ahi

09-02-2009, 10:09 PM	#41
ekaser Opinion Artiste Posts: 301 Karma: 61464 Join Date: Mar 2009 Location: Albany, OR Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire	Footnotes/Annotations While on the subject, remember that 'annotations' can take the form of footnotes at the bottom of the page, endnotes at the end of the chapter, and annotations at the end of the book. Three unique places you have to keep track of and place them.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Best pdf to text/rtf/whatever I have ever seen	jblitereader	Ectaco jetBook	13	07-10-2010 12:02 AM
RTF and TEXT conversion	spaze	Calibre	4	08-23-2009 03:11 AM
Automatic .Lit extractor for the iLiad	Adam B.	iRex	34	09-25-2008 07:20 PM
kovidgoyal: templatemaker -- automatic data extractor	sammykrupa	Sony Reader	1	07-21-2007 01:52 PM
Text to RTF question.	Roy White	Sony Reader	0	05-12-2007 06:59 PM

Advert

Advert