MobileRead Forums - View Single Post - pacify.py (Text reformatter / RTF extractor)

ahi · 09-01-2009, 02:26 PM

Quote:

Originally Posted by ekaser

Another thought regarding justification formatting: it's not character by character formatting, but rather paragraph by paragraph, as it were. Justification (left, center, right, full) could be handled by two (or four, your choice) bits in the format 'byte' for the end-of-paragraph/block character (CR, LF, NUL, whatever you choose to use). That character doesn't need to worry about the normal character formatting issues, so it can be used to hold 'block' formatting information. For easiest sequential processing, you might want said block formatting info to apply to the FOLLOWING paragraph rather than the preceding one, in which case you'd want to start the document with a special end-of-block character that never gets emitted to the output, which is there simply to hold the block formatting info for the first block.

Just stream of consciousness stuff, take it for what it's worth...

I appreciate it more than you know! I've been hoping just for your sort of comments.

My original long-ago grand plan for pacify was to parse the input text in such a way that I can correlate any given character of the text to the word that said text is part of, the sentence that it is a part of, and the paragraph that it is a part of.

Doing so, I still think, would make for some very nice high-level processing possibilities.

Quote:

Is the current character a single quote?
Is there a single quote quoted portion already open?
Is there a space to the left or the right of the quote?
Does the word immediately before it end with an "s"?

or

Quote:

Are both the previous and the next paragraphs approximately as long as the current one?
Are all three shorter than the average line-length in most of the document?
Does the first short paragraph that started this series of short paragraphs start with a comma?
Does the last short paragraph at the end of this series of short paragaphs end with a comma?

It's already these sort of questions that my program asks... but it has no high-level function calls to accommodate such queries... so it's a bit harder to see/understand what it is doing at many places.

The reason I've abandoned this idea is because:

1) The interconnections between words, paragraphs/lines, and characters seems to me to be a bit more complex than I can readily visualize... and hence I'm not too sure how to go about it. Not to mention that some characters are not really words (commas, periods, apostrophes, dashes), but need to appear at the world level, and some characters are not part of a paragraph but should appear at paragraph level (a single newline character between paragraph, or three newlines at the end of a chapter, before the next chapter's title)... not sure how to deal with these.

2) Even if I could create a class that goes from plaintext into this almost "database" sort of format... (which does not even yet account for formatting or anything else) I haven't yet wrapped my mind around how I would updated all levels while doing text processing... or, rather, how to do so in a simple enough way not to fully counteract the simplicity of querying by the complexity of altering.

Still, I'd be curious to hear what you think.

- Ahi