MobileRead Forums - View Single Post - pacify.py (Text reformatter / RTF extractor)

ahi · 09-28-2009, 01:46 PM

Quote:

Originally Posted by frabjous

Can you flag a character as "beginning a new cell" and/or "beginning a new row"? (Inserting a tab before the former, and a linefeed before the latter may be sufficient for plain text output.)

I think the basic idea of the script is consistent with simply stripping things like the lines and border styles around the tables and between the cells. Losing column alignment is a bit more of a cost, but maybe that can be preserved somehow?

Alignment in general is a bit of an issue...

Bold and italic text is the sort of thing that one can reasonably assume that the source documents uses "correctly" (for a reasonably broad definition of "correct"). Alignment tomfoolery, however, is used for different things that *correctly* ought to be handled in different ways.

Just in the eBooks I've been playing around with thus far...

Centred text can mean a chapter, a subtitle, a chapter summary, book metadata, et cetera.

Right-aligned text can mean an epigraph, a signature, a date, et cetera.

When outputting HTML, arguably the limitations of the output format mean that simply centering or right-aligning the text as it was in the source is good enough. But for LaTeX output, it would be much preferable to handle each of those different things correctly in terms of the LaTeX's memoir class.

Admittedly perhaps cell alignment in a table is on par with bold/italic formatting in a paragraph... one can trust that it is correct as is, and needs no context-dependent special handling.

I think I need to rethink how the formatting/classification is handled. (Fortunately it won't be too much work to fix/update.)

I think I need to separate formatting from classification (and from footnotes/annotations/et cetera) like I originally intended. Formatting needs to be handled and mangled on its own, unfettered by miscellaneous non-formatting stuff.

I am actually starting to think that the power of pacify will ultimately derive from the simplicity of its approach of dealing with (mostly) one thing at a time: either the text, the formatting, or the content classification.

---

And, to answer your question, yes, marking table structure/table cells in the classification layer/stream is probably the right approach... which takes pacify toward its natural conclusion of using the text and formatting layer to generate the classification layer, but using only the text and classification layer (i.e.: not the formatting layer) for generating its output. For the simplest stuff (bold/italics) the formatting and classification layer will more or less encode the same information, but the classification layer should ultimately know even chapters, poems, et cetera from regular text.

- Ahi