MobileRead Forums - View Single Post - pacify.py (Text reformatter / RTF extractor)

ahi · 09-20-2009, 10:41 PM

If you are around, ekaser, I am mostly done with the rearchitecting.

I have a(n admittedly very simple) plugin architecture in place, where basically all functionality (excepting the core classes used by the processing) come from plugins that are classified either as an (1) input plugin, (2) a language plugin, (3) a processing plugin, or (4) an output plugin.

It makes for very pleasantly clean development... albeit somewhat torturous command line handling, as the plugins (and the [command line based] choice of which specific plugins to use at runtime) are relevant in figuring out what are correct and what are erroneous command line arguments.

I have a plaintext, an RTF, and an HTML input plugin working already fairly well, and a plaintext output and HTML output plugin likewise functional, if a bit immature as yet. The language plugins are handled in such a way that (1) they can somewhat customize the pacify class('s running instance) to potentially alter other plugins' behavior, but mostly (2) preprocess the text right after its read in from the input file in whatever language-specific way, and (3) post-process the text after all other plugins are done but before it is written to the output file.

Switching my development to Python 3 also got rid of difficult to understand and (for me) seemingly impossible to definitively correct unicode related errors my pacify script previously suffered from.

The only point of (as yet) shame is that I have not had the fortitude to fully implement my crazy text-as-database concept yet. My formatted text string class objects are being manipulated fairly directly.

I probably should bite the bullet and take my time to figure out both the spooling (which I am, to be honest, yet to fully wrap my mind around--any good "idiot's guide" level resources you can point me toward) and the text-as-database stuff... but I am just too impatient for practical results to do so.

On the upside, if and when I do get around to doing that stuff... I should be able to insert the necessary code fairly readily without having to make radical changes in too many places.

I'm not going to upload another version until it's able to produce reasonably useful output... but it's getting closer. I've decided to build categorization into the formatting stream as well... probably not incredibly efficient... but unless it starts to cause problems with even files just dozens of MB large, I'll probably stick with it for now (and once spooling is implemented, that should take care of the problem altogether). I am also thinking of implementing footnotes/endnotes (and perhaps annotations?) in the formatting stream too... but I'm now thinking I will not bother with links at all. I cannot think of any input documents (other than of the "choose your own adventure" variety, which is fairly rare) where existing links ought necessarily be respected, instead of new links being generated as warranted by the document's structure. (Albeit perhaps in HTML, there should be some ability to interpret links as footnotes when appropriate.)

Just wanted to share where I am and what I've done.

- Ahi

09-20-2009, 10:41 PM	#62
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	If you are around, ekaser, I am mostly done with the rearchitecting. I have a(n admittedly very simple) plugin architecture in place, where basically all functionality (excepting the core classes used by the processing) come from plugins that are classified either as an (1) input plugin, (2) a language plugin, (3) a processing plugin, or (4) an output plugin. It makes for very pleasantly clean development... albeit somewhat torturous command line handling, as the plugins (and the [command line based] choice of which specific plugins to use at runtime) are relevant in figuring out what are correct and what are erroneous command line arguments. I have a plaintext, an RTF, and an HTML input plugin working already fairly well, and a plaintext output and HTML output plugin likewise functional, if a bit immature as yet. The language plugins are handled in such a way that (1) they can somewhat customize the pacify class('s running instance) to potentially alter other plugins' behavior, but mostly (2) preprocess the text right after its read in from the input file in whatever language-specific way, and (3) post-process the text after all other plugins are done but before it is written to the output file. Switching my development to Python 3 also got rid of difficult to understand and (for me) seemingly impossible to definitively correct unicode related errors my pacify script previously suffered from. The only point of (as yet) shame is that I have not had the fortitude to fully implement my crazy text-as-database concept yet. My formatted text string class objects are being manipulated fairly directly. I probably should bite the bullet and take my time to figure out both the spooling (which I am, to be honest, yet to fully wrap my mind around--any good "idiot's guide" level resources you can point me toward) and the text-as-database stuff... but I am just too impatient for practical results to do so. On the upside, if and when I do get around to doing that stuff... I should be able to insert the necessary code fairly readily without having to make radical changes in too many places. I'm not going to upload another version until it's able to produce reasonably useful output... but it's getting closer. I've decided to build categorization into the formatting stream as well... probably not incredibly efficient... but unless it starts to cause problems with even files just dozens of MB large, I'll probably stick with it for now (and once spooling is implemented, that should take care of the problem altogether). I am also thinking of implementing footnotes/endnotes (and perhaps annotations?) in the formatting stream too... but I'm now thinking I will not bother with links at all. I cannot think of any input documents (other than of the "choose your own adventure" variety, which is fairly rare) where existing links ought necessarily be respected, instead of new links being generated as warranted by the document's structure. (Albeit perhaps in HTML, there should be some ability to interpret links as footnotes when appropriate.) Just wanted to share where I am and what I've done. - Ahi