MobileRead Forums - View Single Post - pacify.py (Text reformatter / RTF extractor)

ahi · 09-01-2009, 11:31 AM

Quote:

Originally Posted by ekaser

Are you aiming this completely at English? If not, if you think you or someone else might want to adapt it to other languages at some point, you might want to use WORD arrays from the start rather than BYTE arrays, so that UNICODE or other character sets could be adopted at some point more easily. That would also give you a few more "formatting options" with 16 flags instead of just 8.

Every data storage method has its advantages and disadvantages. For this type of data-stream/formatting combination, you've pretty much got:
1) in-stream (data and formatting mixed in same stream of bytes)
2) parallel streams (what you're considering)
3) in-stream flags (a combo of 1) and 2) with wider 'bytes' (WORDS or DWORDS) with flags in the upper bits.
4) packets (blocks of text with common formatting)
5) stream and heap-of-format-pointers

and probably several other convoluted methods. Which works best depends a great deal upon what your 'application' needs to accomplish. An application that primarily has to DISPLAY the data might work better with 4) or 5), whereas an application that does NOT need to display the data will probably work better with one of the others, and which one of them will depend upon the nature of the processing that's being done. 4) and 5) are more memory efficient, but more code complex.

For what I THINK it is you're trying to accomplish (primarily file format shifting of fairly simple text files), then what you suggest should work quite well, since memory usage is generally no longer such an issue. When memory was less ... abundant, then code complexity was often the sacrificial lamb to memory usage.

I oversimplified without saying so. I am in fact aiming this to be reasonably international... or at least have the potential to be so.

I am using UTF-8 presently... but might ultimately need to switch to literally using lists cart-blanche instead of strings, as some of my processing needs relate to Unicode Extension Plane B Chinese characters that use 3 (or 4? I forget) bytes in UTF-8 to represent a single display-character... and python, even in UTF-8 mode, treats them at least as two separate characters. If I go this route, doubtless I will be choosing reliability over speed in a big way.

And actually, while you're here, though memory shouldn't be an issue... I have 3 GB RAM and 32 GB swap space under my Linux setup, I keep getting Python memory errors when trying to process RTF files between 400 MB - 1 GB in size.

Yes, yes... "Duh!", I know. But is there a way to increase Python's memory limit? My system can obviously take it... as less than a tenth of the swap is used before Python dies... so I'd really like to force Python to process these huge behemoths... even if it slows my system to a crawl for minutes or even hours.

Any tips for me, ekaser?

- Ahi