MobileRead Forums - View Single Post - pacify.py (Text reformatter / RTF extractor)

ahi · 09-01-2009, 02:58 PM

Quote:

Originally Posted by ekaser

EDIT: Note, I'm not suggesting you include a 'general' HTML or LaTeX parser, anymore than you're going to have a general RTF parser, just the "basic stuff" that you want to keep and throw everything else away. Sure, some files it would make a mess of, but those files probably wouldn't be appropriate for this style of conversion either. I'm assuming this is aimed at "simple novel" types of books that don't have a lot of fancy formatting to start with. One thought: since you're taking RTF as input files, some of those will have images (covers, maps, etc), so I'm hoping that those image tags would be maintained along with the bold, italic, etc, right? That would imply the need to be able to include a "numbered mark" in the formatting string. Perhaps if the most significant bit of the formatting 'character' was set, then the lower bits are the 'number' of the image (on the "image stack") that should be inserted at that point. Of course, that then also brings up the question of image positioning: left, center, right.

I assumed that's what you had meant. And yeah, this would be mainly for reasonably simple formatted stuff... though I am also doing work with a number of large dictionaries and lexicons... so if I figure out a sane way of autodetecting "the right way" to handle them, I'd include a switch for that too in the command line.

Ultimately, I think what would be achievable (and what is my short term goal):

1. Input plaintext from TXT or extract text with limited formatting information from RTF or HTML.
2. Fix-up character mish-mash... ("..." replace with "…", "--" with "–" or "—", et cetera)
3. Try to detect and, with user approval, correct erroneous paragraph breaks.
4. Smarten quotation marks.
5. Try to detect poems, letters, quotations and mark them somehow. (A third layer, with a single setting per line, as opposed to per unicode character?)
6. Try to detect part, chapter, section headers... possibly interactively with help from user to make more accurate.
7. Output with formatting intact into the chosen format.

In the case of simple novels, with no multi-level headers (i.e.: chapters, sections, et cetera) I think such a process should be able to create a nearly perfect file even without user interaction.

In the case of more complex novels, a fair bit of work would remain, but the existing "tagging" by the script ought to result in mostly false positives that would be fairly quick to correct.

- Ahi