Thanks for shared insight. I will look into this matter when I start my testing-spree for realz. My parser makes loads of assumptions about the text to be parsed, for one, it expects to parse a litterary book (this will be easily extendable for a coder in need for a 'bible parsing mode', or whatever; eg., the assumptions of the logical structure of the text are [hopefully] pretty much separated from the rest of the code).
I have another reason for this agressive approach. My reader's software dictionary is rendered unusable by wrongly hyphened words, and I read in a couple of languages not my own [This is in fact the sole reason I'm writing this program. Some fucked up laws and royalties prevents me from buying e-books, and most... ehrm... less commercial books, comes as PDFs]. So rather too few hyphens than too many.
I'm not a Python programmer either, in fact this is my first Python project =)
|