MobileRead Forums - View Single Post - pacify.py (Text reformatter / RTF extractor)

ahi · 09-03-2009, 12:58 PM

Hmmm... it seems like your suggestions would take me back down the road of (effectively) a pre-calculated database being parsed from the input text. Perhaps that means there is great enough merit to the idea that it is worth the pain of implementation?

Certainly, there are processing tasks that are best handled looping through character by character. But then, I suppose, there are tasks where going word by word or sentence by sentence really would be helpful...

1) Loop through all words to try to identify instances of words that contain an apostrophe (whether at beginng, end, or penultimate position) but have not yet been identified as such. Once this is done, and oddities like << 'Tis >> are identified, the quotation mark smartening function could be made simpler by having it ignore any single quote/apostrophe that is considered to be part of a word.

2) Loop through all words to try to find words that have been accidentally run together.

3) Loop through all sentences (in the classical sense) to identify any that end abruptly and may indicate an erroneous paragraph break.

---

You are probably right about keeping line-breaks intact, instead of liberally stripping them out.

Regarding footnotes... the only misunderstanding is that I conceive of a footnote as being always tied to a single character, the "*", "**" or "1" or whatever footnote mark being an output concern and the tied-to character being whatever precedes the footnote mark.

e.g.: in "exceptional* service", I see the "*" character as something to remove at input time and reinsert at output time, and the footnote would be actually tied to the "l" (last letter of 'exceptional') and thus appear immediately after it. This has the benefit of removing all sign of the footnote from the text stream, so it doesn't interfere with processing.

Oh, and the difficult to parse sentence about "link start" and "link end" was just my attempt to say that if there are redundant <a href=""> tags in an HTML document, treating them the same way as I treat the formatting would automatically simplify them.

---

Regarding the internationalization... I think perhaps all I need to do is to build the skeleton in such a way that language-specific processing functions in .py files can override generic processing functions in the main file. And, obviously, ensure that when the language of a given text is known, to only allow the correct language's processing functions to override.

Shouldn't be too difficult, thanks to Eval. Sort of a minimalist plug-in-like architecture would result, I suppose.

---

Ultimately, I'm now thinking the way to do things is to have pTome basically contain a linked list of pBlocks, the classification of which could range from 'line-break' to 'chapter title' to 'paragraph', et cetera.

Then the pBlocks in turn would have text subdivided into one or more pParts, some of which may be classified as "sentence" or "poem line" et cetera, and each of which in turn would contain one or more pItems (which would be, as per your own thinking, my current sort of pStrings or something like it--everything else being mostly for containment and classification/categorization) that would be classified as "space" or "punctuation" or "word" et cetera.

And when a change is made, only the specific pBlock, pPart, or pItem would have to be changed and all levels underneath regenerated via reparsing.

Makes sense? This way, I theoretically ought to be able to loop through all words within all sentences, while still querying "what character is before this word" or "does the sentence previous to this one end with punctuation/the way a proper sentence should".

- Ahi

09-03-2009, 12:58 PM	#46
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Hmmm... it seems like your suggestions would take me back down the road of (effectively) a pre-calculated database being parsed from the input text. Perhaps that means there is great enough merit to the idea that it is worth the pain of implementation? Certainly, there are processing tasks that are best handled looping through character by character. But then, I suppose, there are tasks where going word by word or sentence by sentence really would be helpful... 1) Loop through all words to try to identify instances of words that contain an apostrophe (whether at beginng, end, or penultimate position) but have not yet been identified as such. Once this is done, and oddities like << 'Tis >> are identified, the quotation mark smartening function could be made simpler by having it ignore any single quote/apostrophe that is considered to be part of a word. 2) Loop through all words to try to find words that have been accidentally run together. 3) Loop through all sentences (in the classical sense) to identify any that end abruptly and may indicate an erroneous paragraph break. --- You are probably right about keeping line-breaks intact, instead of liberally stripping them out. Regarding footnotes... the only misunderstanding is that I conceive of a footnote as being always tied to a single character, the "", "" or "1" or whatever footnote mark being an output concern and the tied-to character being whatever precedes the footnote mark. e.g.: in "exceptional service", I see the "*" character as something to remove at input time and reinsert at output time, and the footnote would be actually tied to the "l" (last letter of 'exceptional') and thus appear immediately after it. This has the benefit of removing all sign of the footnote from the text stream, so it doesn't interfere with processing. Oh, and the difficult to parse sentence about "link start" and "link end" was just my attempt to say that if there are redundant <a href=""> tags in an HTML document, treating them the same way as I treat the formatting would automatically simplify them. --- Regarding the internationalization... I think perhaps all I need to do is to build the skeleton in such a way that language-specific processing functions in .py files can override generic processing functions in the main file. And, obviously, ensure that when the language of a given text is known, to only allow the correct language's processing functions to override. Shouldn't be too difficult, thanks to Eval. Sort of a minimalist plug-in-like architecture would result, I suppose. --- Ultimately, I'm now thinking the way to do things is to have pTome basically contain a linked list of pBlocks, the classification of which could range from 'line-break' to 'chapter title' to 'paragraph', et cetera. Then the pBlocks in turn would have text subdivided into one or more pParts, some of which may be classified as "sentence" or "poem line" et cetera, and each of which in turn would contain one or more pItems (which would be, as per your own thinking, my current sort of pStrings or something like it--everything else being mostly for containment and classification/categorization) that would be classified as "space" or "punctuation" or "word" et cetera. And when a change is made, only the specific pBlock, pPart, or pItem would have to be changed and all levels underneath regenerated via reparsing. Makes sense? This way, I theoretically ought to be able to loop through all words within all sentences, while still querying "what character is before this word" or "does the sentence previous to this one end with punctuation/the way a proper sentence should". - Ahi