MobileRead Forums - View Single Post - pacify.py (Text reformatter / RTF extractor)

ekaser · 09-03-2009, 12:11 PM

Quote:

Originally Posted by ahi

Perhaps I should design the formatting portion with full granularity to match or even exceed the level of detail preserved in RTF... and, for now, just have the intake functions happily ignore the stuff I perceive to be of no interest. (precise font size, and similar things)

It certainly seems to me that you should DESIGN for the possibility of using those things in the future, design to be as 'inclusive' as possible, while just stubbing-out and/or ignoring most of that stuff for now. But definitely at least put thought into how to support a lot of that stuff even if you may never use it. Extra thought now prevents MAJOR pain later... :-)

Quote:

Hmmm... you are absolutely right, and it troubles me a little.

I'm confronted with that reaction a lot, but most folks refuse to admit it...

Quote:

pPar, for the sake of simplicity, needs to "end with a line break"... so to speak. In reality, I think I will end up stripping out line breaks, so the assumed line breaks at the end of pPar's will be crucial to the production of correct formatting in the output.

I'm not sure I understand. In my mind, there's a high likelihood that you'll be adding and removing "end of line breaks" as you format, so they're really fairly fluid. It seems to me that really what you want, instead of pTome and pPar, is a pBlock and pItem, where a pBlock is a 'collection' of pBlocks AND pItems, and a pItem is a discrete 'piece' of text. (Call them what you will, I'm just using different terms than you to facilitate easier differentiation from what thoughts you have attached to pTome and pPar.) A pItem is always a 'chunk' of text, plain, simple, no nested or sub-sections of text, no attached footnotes, etc:

An image
A sentence
A fragment of a sentence (down to possibly just a single word)
A break

A paragraph is just a pBlock of sentences with a paragraph-break at the end. Potentially, a 'sentence' could be defined as "all the text in the paragraph or any sub-portion thereof", OR as "a standard sentence that ends with a period" or some such. I'd tend towards the former.

A 'break' can be any one of:

Line break (such as at the end of each line of poetry)
Paragraph break (may trigger auto-first-line indent, etc)
Scene break (those extra blank lines in novels that separate scene changes)
Page break
Sub-section break
Section break
Chapter break
Part break
Book break

"Chapter-break" is different from "page-break"... page-break is just "start a new page", where "chapter-break" COULD be either "start a new page" OR "start a new EVEN numbered page" OR "start a new ODD numbered page"). For novels, you're just looking at Line, Paragraph, Scene, Page and/or Chapter, and possibly Part breaks. For textbooks and some other non-fiction books, Section and multi-level Sub-section breaks (ie, numbered sub-sections like 2.4.1, etc) come into play.

A pBlock then, is a COLLECTION (or LIST) of pBlocks AND pItems, representing anything including:

The entire 'book'
"Meta information" (not sure of right word, but title, author, copyright, etc)
A table of contents
A section or 'Part'
A chapter
A sub-chapter (potentially multi-level, as in text books, etc)
A header or footer (are you thinking of supporting those?)
A link
An anchor (target of a link)
A footnote, endnote, annotation
An index (a whole different can of worms)
A pItem

"Meta information" can apply to both a Book and also to Parts and Chapters, as they have titles, too.

In other words, the Book is a pBlock containing a list of pBlocks comprising the Meta Info, TOC, a list of Parts OR a list of Chapters, maybe Annotations, maybe Index, and a Book-break.

A PART pBlock would consist mostly of Chapter pBlocks (with maybe leading title (1,2,3 or I,II,II, etc) and/or quote/title).

A CHAPTER pBlock would consist mostly of a collection of sentences and paragraph-breaks.

And so on...

Only at the sentence and sentence-fragment level would you have the formatting strings.

However, I'm not sure how all of that would fit into your ideas for "ease of testing and reformatting" the text. That's just how my brain tends to approach things: break them down, structure them into as few types of nesting and repeating items as possible. Whatever you do, you have to encode in your 'database' the STRUCTURE of the book, the FORMATTING of the text, and the TEXT itself.

I'm just trying to throw grist into your mill...

More random grist: bottom-of-page footnotes are problems. Usually, they will get processed as soon as the "footnote link" is encountered, and their height gets removed from the height of that current page (or whatever), but sometimes the footnote reference will be too close to the bottom of the page for the entire footnote to fit in the remaining space on that page, so it has to get split and run over onto the bottom of the next page. Just grist. Another thought: it MAY be possible to use your 'standard' link/anchor mechasim for footnotes (after all, that's really what they are, the original hyperlinks). The only difference is where/how the target text is automatically placed.

Quote:

Perhaps I need to think of pTome as a contiguous "grouping" of text. Whether that grouping is 1 book, 1 part, 1 chapter, or a mere 1 footnote/annotation.

Yes, I think so, ie, similar to my pBlock discussion above.

Quote:

At least one target of my conversion activity is basically The Great Hungarian Defining Dictionary (not an accurate translation of its name, but it describes fairly well what it is). Basically 100 MB of RTF, with only a few small pictures here and there... and potentially my way of wanting to convert this could easily have over 100,000 intra-document links.

Yikes! That could DEFINITELY be a challenge to hold all of that in memory at once, after you've converted everyting into 'database' form. Building in file-spooling up front is, I think, a must.

Quote:

Also, keep in mind, not treating links with "link open here" and "link end here" sort of solutions works to include links under the automatic formatting refactoring umbrella. Something though that has no relevance to annotations and footnotes, on account of those being "one dimensionally referenced" (insert here, as opposed to span this length).

I'm not sure I entirely parsed those sentences the way you meant, but really there's no reason a footnote 'link' needs to be handled any differently than a 'hyperlink'. Yes, the footnote link is generally a 1-character link, but sometimes it's ** or *** or [1], [2], etc, and once you have to handle more than 1 character, there's really no difference between a footnote link and a hyperlink of arbitrary length. The only difference is that a footnote has to have it's target text placed automatically, where the hyperlink target just "happens in the flow of the document".

Quote:

I think the main reason for my wanting to have these three in additional parallel streams, in a way, is to be able to arbitrarily alter the main text without having to make my code do contortions to ignore the fact that (for a lot of practical considerations) a footnote does not really come between two characters of the main body text.

The 'footnote' link or the 'footnote' text (target)? The link DOES come between two characters of the text, just like a hyperlink does, and flows along with it, just like any other 'railcar' in the 'train'. Or maybe I'm misunderstanding again?