MobileRead Forums - View Single Post - pacify.py (Text reformatter / RTF extractor)

ekaser · 09-01-2009, 01:03 PM

Quote:

Originally Posted by ahi

Though memory shouldn't be an issue... I have 3 GB RAM and 32 GB swap space under my Linux setup, I keep getting Python memory errors when trying to process RTF files between 400 MB - 1 GB in size.

Yes, yes... "Duh!", I know. But is there a way to increase Python's memory limit? My system can obviously take it... as less than a tenth of the swap is used before Python dies... so I'd really like to force Python to process these huge behemoths... even if it slows my system to a crawl for minutes or even hours.

Sorry, I'm not a Python expert (just starting on it, really, I'm a long-time C guy), so I can't help you with Python. But the old, tried and true method, is to not read the whole thing in all at once, read it in chunks, process that chunk, when you get "close to the end" of the chunk, move it up and refill the queue with the next chunk from the file and keep going. Of course, that works better with some 'things' than others, but I would think it would work reasonably well with .rtf text files, which are pretty linear beasts. You might have to keep around a 'stack' of "open blocks" for text that's long since been flushed from the processing queue, so that you know what's pending when you reach the end of that block in the queue, but probably not. If you make the processing queue sufficiently large (4M? 8M? 16M 32M? any of those would probably be plenty big and would avoid the "memory issues"), then you could update/refill the queue at opportune moments. In managing the queue, you can either move the unused portion up and then refill from there to the end of the queue, or just keep pointers to the start and end of the unprocessed portion, and refilling the queue then involves two reads, the first to fill the tail-end portion of the unfilled queue and the second to fill the front-end unused portion. If/when speed of processing is not an issue (which I don't think it is in your case), then move-and-fill is preferred, because it makes the rest of the code MUCH simpler. With "rotating pointers", you're constantly checking for reaching the end of the queue and whether the end pointer is greater or lesser than the start pointer and such. PITA.