MobileRead Forums - View Single Post - pacify.py (Text reformatter / RTF extractor)

ekaser · 09-01-2009, 03:19 PM

Quote:

Originally Posted by ahi

My original long-ago grand plan for pacify was to parse the input text in such a way that I can correlate any given character of the text to the word that said text is part of, the sentence that it is a part of, and the paragraph that it is a part of.

1) The interconnections between words, paragraphs/lines, and characters seems to me to be a bit more complex than I can readily visualize... and hence I'm not too sure how to go about it. Not to mention that some characters are not really words (commas, periods, apostrophes, dashes), but need to appear at the world level, and some characters are not part of a paragraph but should appear at paragraph level (a single newline character between paragraph, or three newlines at the end of a chapter, before the next chapter's title)... not sure how to deal with these.

2) Even if I could create a class that goes from plaintext into this almost "database" sort of format... (which does not even yet account for formatting or anything else) I haven't yet wrapped my mind around how I would updated all levels while doing text processing... or, rather, how to do so in a simple enough way not to fully counteract the simplicity of querying by the complexity of altering.

I agree, it's a complex issue. That makes it all the more important that you understant EXACTLY what it is you're trying (or want) to accomplish, and structure it right from the beginning. From reading your descriptions, it sounds like you want to make 'grammatical' changes as well as simple formatting.

It seems to me that you really do need to read the whole file in (and spooling it out to a temp file, and re-read the spool file to do more processing, and re-spool to another temp file, lather, rinse, repeat...), first breaking it up into 'apparent' sections, chapters, blocks, paragraphs, whatever. THEN, you can start processing what you've detected from the input, to see what you want to change. Maybe, roughly, something like this:

1) Read input file.
a) start block
b) add items (image, paragraph, links, etc) to spooled block
i) images are automatically a block unto themselves
ii) paragraphs of text are anything terminated by CR or LF or CR/LF
(you have to keep track of what end-of-line sequence the source
file uses and use the same thing again on the output).
iii) a link is a block, just like an image, but with associated text, and
is terminated by the end of link, not by EOL characters.
c) repeat a) and b) until entire file has been parsed in.

Step 1) is all about reading in and parsing, NO modification (other than possible removal of duplicated tags, like bolds and italics, that will happen naturally as a part of the conversion to the internal format.

2) Start processing the file.
a) Look for structural changes first.
i) Are there chapters/sections?
If so, are any of them missing?
Are any of them duplicated?
Try to straighten these out first.
ii) Once the chapters/sections are located/fixed, try to find indents.
These are indented quotes, poetry, etc. Just identify them.
iii) Try to identify intentional extra line breaks (between what looks
like the end of one paragraph and start of another, "scene breaks")
iv) Remove all other extaneous line breaks that aren't scene breaks
and which don't go before or after indented quotes/poetry, ie.
extraneous duplicate sequential line breaks.
b) Look for grammatical problems (this is the toughest part!), the kinds
of things you've already called out. Most of these will remain WITHIN
a paragraph-block, but it will also sometimes involve joining two
paragraphs together that have been eroneously split.

Step 2) will involve writing several copies of the temp spool file, as each successive output file becomes the input file to the next phase of things to look for. Once you have run out of things to look for and process, then:

3) Read the final temp spool file and output the final output file.

See, simple as that. Nothing to it.

Truly, that's the way I'd approach it. It's a long hard slog, because you're not getting to the really cool, slick stuff for quite a while, but I'd first get all of the input/parsing/spooling/output stuff working nice and clean. Once that is done, then you can almost forget about it and focus on the 'good' stuff, and can take each part of it a step at a time, not worrying about how many times you read and write temporary spool files.

Another thing that occurs to me: there's probably going to be a bunch of the changes that you make that you won't want to output to the user right away, as you'll want to refer them to specific line #'s of the final output file. So you may want to insert "change blocks" into the stream as you're processing. You can think of those "change blocks" as "debug information", and they will be completely omitted from the final output file and are simply skipped as you're working on the file, but their purpose is to then generate the console/debug output information to the user AS you're writing that final output file... "Removed 3 blank lines at line 23", "Joined broken paragraphs at line 79", etc.

All intended merely as "food for thought"...

EDIT: Well, hell, I screwed the pooch on THAT formatting!