MobileRead Forums - View Single Post - pacify.py (Text reformatter / RTF extractor)

ekaser · 09-01-2009, 11:12 AM

Quote:

Originally Posted by ahi

A lot of what I am trying to do with pacify.py is going to be text processing... but at the same time, I do want to be able to handle some light formatting--bold, italics, maybe a bit more.

Unfortunately any obvious/straightforward way of handling formatting interferes with the straightforwardness of any text processing.

I have a vague idea in my head about creating a class in python that would facilitate both formatting and text processing concerns, by keeping content in the following manner:

For any string of length X, it would store two strings of length X. The first stored string would be the plaintext, the second stored string would be byte-long bitfields that provide formatting information.

And then any operation done on the plaintext (via the class's methods) would perform the equivalent operation on the formatting string. This way content and formatting could be dealt with separately without having to painstakingly escape formatting instructions for any text-processing operation.

Are you aiming this completely at English? If not, if you think you or someone else might want to adapt it to other languages at some point, you might want to use WORD arrays from the start rather than BYTE arrays, so that UNICODE or other character sets could be adopted at some point more easily. That would also give you a few more "formatting options" with 16 flags instead of just 8.

Every data storage method has its advantages and disadvantages. For this type of data-stream/formatting combination, you've pretty much got:
1) in-stream (data and formatting mixed in same stream of bytes)
2) parallel streams (what you're considering)
3) in-stream flags (a combo of 1) and 2) with wider 'bytes' (WORDS or DWORDS) with flags in the upper bits.
4) packets (blocks of text with common formatting)
5) stream and heap-of-format-pointers

and probably several other convoluted methods. Which works best depends a great deal upon what your 'application' needs to accomplish. An application that primarily has to DISPLAY the data might work better with 4) or 5), whereas an application that does NOT need to display the data will probably work better with one of the others, and which one of them will depend upon the nature of the processing that's being done. 4) and 5) are more memory efficient, but more code complex.

For what I THINK it is you're trying to accomplish (primarily file format shifting of fairly simple text files), then what you suggest should work quite well, since memory usage is generally no longer such an issue. When memory was less ... abundant, then code complexity was often the sacrificial lamb to memory usage.