pacify.py (Text reformatter / RTF extractor) - Page 4

ahi · 09-03-2009, 12:58 PM

Hmmm... it seems like your suggestions would take me back down the road of (effectively) a pre-calculated database being parsed from the input text. Perhaps that means there is great enough merit to the idea that it is worth the pain of implementation?

Certainly, there are processing tasks that are best handled looping through character by character. But then, I suppose, there are tasks where going word by word or sentence by sentence really would be helpful...

1) Loop through all words to try to identify instances of words that contain an apostrophe (whether at beginng, end, or penultimate position) but have not yet been identified as such. Once this is done, and oddities like << 'Tis >> are identified, the quotation mark smartening function could be made simpler by having it ignore any single quote/apostrophe that is considered to be part of a word.

2) Loop through all words to try to find words that have been accidentally run together.

3) Loop through all sentences (in the classical sense) to identify any that end abruptly and may indicate an erroneous paragraph break.

---

You are probably right about keeping line-breaks intact, instead of liberally stripping them out.

Regarding footnotes... the only misunderstanding is that I conceive of a footnote as being always tied to a single character, the "*", "**" or "1" or whatever footnote mark being an output concern and the tied-to character being whatever precedes the footnote mark.

e.g.: in "exceptional* service", I see the "*" character as something to remove at input time and reinsert at output time, and the footnote would be actually tied to the "l" (last letter of 'exceptional') and thus appear immediately after it. This has the benefit of removing all sign of the footnote from the text stream, so it doesn't interfere with processing.

Oh, and the difficult to parse sentence about "link start" and "link end" was just my attempt to say that if there are redundant <a href=""> tags in an HTML document, treating them the same way as I treat the formatting would automatically simplify them.

---

Regarding the internationalization... I think perhaps all I need to do is to build the skeleton in such a way that language-specific processing functions in .py files can override generic processing functions in the main file. And, obviously, ensure that when the language of a given text is known, to only allow the correct language's processing functions to override.

Shouldn't be too difficult, thanks to Eval. Sort of a minimalist plug-in-like architecture would result, I suppose.

---

Ultimately, I'm now thinking the way to do things is to have pTome basically contain a linked list of pBlocks, the classification of which could range from 'line-break' to 'chapter title' to 'paragraph', et cetera.

Then the pBlocks in turn would have text subdivided into one or more pParts, some of which may be classified as "sentence" or "poem line" et cetera, and each of which in turn would contain one or more pItems (which would be, as per your own thinking, my current sort of pStrings or something like it--everything else being mostly for containment and classification/categorization) that would be classified as "space" or "punctuation" or "word" et cetera.

And when a change is made, only the specific pBlock, pPart, or pItem would have to be changed and all levels underneath regenerated via reparsing.

Makes sense? This way, I theoretically ought to be able to loop through all words within all sentences, while still querying "what character is before this word" or "does the sentence previous to this one end with punctuation/the way a proper sentence should".

- Ahi

ekaser · 09-03-2009, 01:55 PM

Quote:

Originally Posted by ahi

Hmmm... it seems like your suggestions would take me back down the road of (effectively) a pre-calculated database being parsed from the input text. Perhaps that means there is great enough merit to the idea that it is worth the pain of implementation?

That's up to you. I'm just trying to provide a second point-of-view, one that may or may not mesh and/or differ from your own. Ultimately, you're writing the code, so you have to (get to) make the decisions about what will work best for what YOU are trying to accomplish.

It just seems to me like there are two primary things: structure and content, with content being further divided into images, text and text formatting. Structure (other than sentence and paragraph massaging) would remain pretty much unchanged, I'd think, and most of the changes your program would make would be to the content. Therefore, separating the two as much as possible, while maintaining the necessary links between structure and content, would be best.

Quote:

Regarding footnotes... the only misunderstanding is that I conceive of a footnote as being always tied to a single character, the "*", "**" or "1" or whatever footnote mark being an output concern and the tied-to character being whatever precedes the footnote mark.

I guess I was seeing footnotes and hyperlinks as being pretty much the same thing. You're going to have to handle hyperlinks, so I figured why not treat the footnotes the same way (in terms of the link itself)? You have to write the code once (for hyperlinks), so if you fold the footnote link into the same mold, you're all done. But YOU'RE the one that has to hold all of this in your head until it's excreted as code, so whatever works best in YOUR head is key...

Quote:

Makes sense? This way, I theoretically ought to be able to loop through all words within all sentences, while still querying "what character is before this word" or "does the sentence previous to this one end with punctuation/the way a proper sentence should".

It's sounding pretty good to me!

ahi · 09-03-2009, 02:04 PM

Quote:

Originally Posted by ekaser

That's up to you. I'm just trying to provide a second point-of-view, one that may or may not mesh and/or differ from your own. Ultimately, you're writing the code, so you have to (get to) make the decisions about what will work best for what YOU are trying to accomplish.

It just seems to me like there are two primary things: structure and content, with content being further divided into images, text and text formatting. Structure (other than sentence and paragraph massaging) would remain pretty much unchanged, I'd think, and most of the changes your program would make would be to the content. Therefore, separating the two as much as possible, while maintaining the necessary links between structure and content, would be best.

Well... yes. I think the reason I feel compelled to try to yield to your implied return to my earlier conception is because the more high-level and real-world I can make the structure (sentences referring to real/detected sentences, et cetera) the simpler I can make my text/content processing code... and, presumably, the less error prone they will be. (Though even in the best scenario, most of them would end up being pretty complex still... which is reason enough to simply where/as possible.)

Quote:

Originally Posted by ekaser

I guess I was seeing footnotes and hyperlinks as being pretty much the same thing. You're going to have to handle hyperlinks, so I figured why not treat the footnotes the same way (in terms of the link itself)? You have to write the code once (for hyperlinks), so if you fold the footnote link into the same mold, you're all done. But YOU'RE the one that has to hold all of this in your head until it's excreted as code, so whatever works best in YOUR head is key...

That is a good point. I will have to give it more thought. Getting rid of the footnote-marks is definitely a useful thing (because it reduces the number of places where processing code needs to account for them or parse around them)... so unless there are any big benefits from simplifying links/footnotes to be handled via the same underlying mechanism, I will likely keep to the approach I described.

It's definitely going to be a monster of sorts... but hopefully with my ideas starting to become increasingly clear and granular, it will end up a tamable monster.

Quote:

Originally Posted by ekaser

It's sounding pretty good to me!

Thanks for the sanity checks!

I'll give you a shout when there is code!

- Ahi

ahi · 09-12-2009, 03:44 PM

I am beginning to make some progress...

Operator overloading is turning out to be tolerably good in python.

Any hints as to how to make a Python class immutable?

My googling thus far suggests that there is no sane, simple way that is universal... and in most instances deriving your class from an already immutable one is suggested.

Since I do not know what all I'd need to override, it seems saner to do things from scratch... but immutability is necessary, I think.

- Ahi

ekaser · 09-12-2009, 03:58 PM

Quote:

Originally Posted by ahi

I am beginning to make some progress...

Any hints as to how to make a Python class immutable?

Sorry, I'm no help there. I've VERY little (almost the same as 'no') experience with Python. Hopefully someone else can help with this one. (Perhaps you should send a message to Kovid... he seems to do a LOT of work in Python, and may be able to give you a quick, easy answer.)

kovidgoyal · 09-12-2009, 04:33 PM

http://en.wikipedia.org/wiki/Immutable_object#Python

Also python has a lot of immutable builtin objects like tuple and frozenset that can be inherited from

ahi · 09-13-2009, 03:49 PM

Thanks, Kovid. I believe I managed to solve the problem.

Do you have any resources you'd recommend relating to UTF-8 handling/character encoding conversion with Python? I keep bumping into UnicodeDecodeError style messages... and I am yet to really grasp the elegant way to avoid them.

- Ahi

ahi · 09-14-2009, 12:01 AM

Not an update/new version yet... but I'm hoping some people might be kind enough to run it on a few .txt and .rtf files... and report back (along with the relevant portion of the log file [the filesize and whitespace analysis numerical values below "Analyzing text"]) whether or not pacify correctly determined whether there are intra-paragraph line-breaks or not in the given file.

Run with:

pacify.py -i input.txt -o txt

or

pacify.py -i input.rtf -o latex

- Ahi

P.s.: This is a rewrite little stopped right in the middle of work... barely usable as is, and does not have the full functionality present in the previous version. Takes only .txt and .rtf for input, produces only .txt or .tex for output. And there is no support for RTF footnotes.

ekaser · 09-14-2009, 10:18 AM

First, a Project Gutenberg TXT file with line breaks:

Quote:

Analyzing text...

[1]: Starting on Mon, 14 Sep 2009 07:07:51

[1]: Simplifying linebreaks...
[1]: Analyzing whitespace patterns...
[1]: 100.0% processed (0.6 MB of 0.6 MB)

[1]: Finished on Mon, 14 Sep 2009 07:08:03

[1]: Filesize:
[1]: 658765

[1]: Whitespace analysis:
[1]: [(104865, 0.0), (8362, 8.0), (2689, 16.0), (41, 24.0), (31, 32.0), (28, 1.0), (20, 9.0), (13, 10.0), (3, 2.0), (3, 17.0), (2, 40.0), (1, 88.0)]

[1]: 1591.84231099
[1]: 126.934491055
[1]: 40.8188048849
[1]: 0.622376720075
[1]: 0.470577520056

[1]: ... appears to be a file with line-breaks.

Then, another TXT file without line breaks, entire paragraph on each line:

Quote:

[1]: Analyzing text...

[1]: Starting on Mon, 14 Sep 2009 07:11:57

[1]: Simplifying linebreaks...
[1]: Analyzing whitespace patterns...
[1]: 100.0% processed (0.3 MB of 0.3 MB)

[1]: Finished on Mon, 14 Sep 2009 07:11:57

[1]: Filesize:
[1]: 317441

[1]: Whitespace analysis:
[1]: [(55965, 0.0), (2342, 9.0), (186, 8.0), (19, 17.0), (5, 11.0), (3, 16.0), (1, 12.0), (1, 55.0)]

[1]: 1763.00477884
[1]: 73.7774893602
[1]: 5.85935654185
[1]: 0.598536420941
[1]: 0.157509584458

[1]: ... appears to be a file with paragraph breaks.

Then one RTF file:

Quote:

[1]: Analyzing text...

[1]: Starting on Mon, 14 Sep 2009 07:13:23

[1]: Simplifying linebreaks...
[1]: Analyzing whitespace patterns...
[1]: 100.0% processed (1.1 MB of 1.1 MB)

[1]: Finished on Mon, 14 Sep 2009 07:13:25

[1]: Filesize:
[1]: 1174398

[1]: Whitespace analysis:
[1]: [(201159, 0.0), (6815, 16.0), (58, 32.0), (15, 1.0), (5, 8.0), (3, 80.0), (2, 48.0), (2, 40.0), (1, 64.0), (1, 24.0)]

[1]: 1712.86906143
[1]: 58.0297309771
[1]: 0.493870050869
[1]: 0.127725013156
[1]: 0.0425750043852

[1]: ... appears to be a file with paragraph breaks.

And then a second RTF file:

Quote:

[1]: Analyzing text...

[1]: Starting on Mon, 14 Sep 2009 07:14:20

[1]: Simplifying linebreaks...
[1]: Analyzing whitespace patterns...
[1]: 100.0% processed (1.2 MB of 1.2 MB)

[1]: Finished on Mon, 14 Sep 2009 07:14:22

[1]: Filesize:
[1]: 1209776

[1]: Whitespace analysis:
[1]: [(187376, 0.0), (16677, 16.0), (128, 65.0), (85, 32.0), (39, 49.0), (31, 81.0), (12, 48.0), (3, 114.0), (3, 97.0)]

[1]: 1548.84871249
[1]: 137.851965984
[1]: 1.05804710955
[1]: 0.702609408684
[1]: 0.32237372869

[1]: ... appears to be a file with paragraph breaks.

ahi · 09-14-2009, 10:51 AM

Quote:

Originally Posted by ekaser

....

Thanks, ekaser!

Am I correct in assuming you encountered no false positives at all?

I have, by the way, successfully processed a 700+ MB .rtf file with this script that yielded 33 MB of LaTeX formatted text as output (the rest having been pictures).

- Ahi

Kirtai · 09-14-2009, 11:33 AM

Quote:

Originally Posted by ekaser

In general, they seem to delete leading space when turning italics ON and sometimes inserting an extraneous space when turning them OFF (usually when followed by punctuation).

Unfortunatly you sometimes want no spaces between formatted and unformatted characters. The first character of the first sentence in a chapter can get all sorts of odd formatting.

Quote:

Originally Posted by ahi

Yes, yes... "Duh!", I know. But is there a way to increase Python's memory limit? My system can obviously take it... as less than a tenth of the swap is used before Python dies... so I'd really like to force Python to process these huge behemoths... even if it slows my system to a crawl for minutes or even hours.

Check your shells ulimit command, it will be able to change the process limits.

ekaser · 09-14-2009, 11:49 AM

Quote:

Originally Posted by ahi

Am I correct in assuming you encountered no false positives at all?

None that I saw, but then I only tried the four files, so that's a fairly limited set of data points.

ahi · 09-14-2009, 12:13 PM

Quote:

Originally Posted by ekaser

None that I saw, but then I only tried the four files, so that's a fairly limited set of data points.

True.

This stuff:

Quote:

[1]: 1548.84871249
[1]: 137.851965984
[1]: 1.05804710955
[1]: 0.702609408684
[1]: 0.32237372869

is based on a percentage calculation (abused beyond recognition) done on the text length (file size, sans formatting) and the whitespace pattern frequency... the first one is the most frequent whitespace sequence, the second one the second most frequent, the third one the third most frequent...

The first one, of course, is "single space" (word breaks). The second one is either paragraph breaks or linebreaks (if there are intra-paragraph linebreaks) whitespace sequence count and the third one (if there are intra-paragraph linebreaks) is the paragraph breaks whitespace sequence count.

Usually, if the second value is above 50, and the third value above 3, it indicates the present of intra-paragraph breaks... but hard values are not the right way to go.

I need to figure out the calculation (perhaps ratios of the whitespace sequence counts?) that yields reliable results in "all" cases. As I suspect there might easily be files out there where there are intra-paragraph linebreaks but perhaps the third value would only come to 2.9. What makes this hard though is that the second and third values do vary pretty wildly... so I'm not too sure comparisons/ratios between those two are the way.

- Ahi

ekaser · 09-14-2009, 02:35 PM

Quote:

Originally Posted by ahi

This stuff is based on a percentage calculation (abused beyond recognition) done on the text length (file size, sans formatting) and the whitespace pattern frequency... the first one is the most frequent whitespace sequence, the second one the second most frequent, the third one the third most frequent...

You're really looking at a statistical analysis (effectively) to figure out what the paragraph formatting of a text file is. Another thing you could include in this process is this maximum line length. You could just figure an "average" line length, and if it's over a certain amount, you probably have a paragraph-break file. Another perhaps better way is to keep a running count of the number of lines of length N, from 0 to ... say... oh, 255. Any line of length >255 gets lumped in with lines of length 255. For line-break files, the bulk of the file will be lines less than some number (80 to 128 max, I'd guess) with very few over that. A paragraph-break file will have far more lines with lengths greater than that limit, with probably quite a few in the 255 bucket.

Also, a 'newline' followed by a QUOTE is almost certainly the start of a paragraph. Count how many times a 'newline'-quote pair is preceded by another 'newline' (ie, paragraphs separated by blank lines). The ratio of the number of 'newline'-'newline'-quote instances to the number of 'newline'-quote instances, would give a pretty good indication of whether it's a line-break or paragraph-break file.

If you use at least 2 or 3 different "statistical rulers" and they all agree (or 2 out of 3), then that's about the best indicator you're going to get.

ahi · 09-14-2009, 02:38 PM

Quote:

Originally Posted by ekaser

You're really looking at a statistical analysis (effectively) to figure out what the paragraph formatting of a text file is. Another thing you could include in this process is this maximum line length. You could just figure an "average" line length, and if it's over a certain amount, you probably have a paragraph-break file. Another perhaps better way is to keep a running count of the number of lines of length N, from 0 to ... say... oh, 255. Any line of length >255 gets lumped in with lines of length 255. For line-break files, the bulk of the file will be lines less than some number (80 to 128 max, I'd guess) with very few over that. A paragraph-break file will have far more lines with lengths greater than that limit, with probably quite a few in the 255 bucket.

Also, a 'newline' followed by a QUOTE is almost certainly the start of a paragraph. Count how many times a 'newline'-quote pair is preceded by another 'newline' (ie, paragraphs separated by blank lines). The ratio of the number of 'newline'-'newline'-quote instances to the number of 'newline'-quote instances, would give a pretty good indication of whether it's a line-break or paragraph-break file.

Good points. Thanks for that. Based on my own tests, it is already fairly accurate... I think I might need to stop relying solely on the whitespace sequences for determining whether the file has intra-paragraph linebreaks... and use the whitespace sequences for processing the paragraph fixing once I've conclusively determined that it does.

- Ahi

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Best pdf to text/rtf/whatever I have ever seen	jblitereader	Ectaco jetBook	13	07-10-2010 12:02 AM
RTF and TEXT conversion	spaze	Calibre	4	08-23-2009 03:11 AM
Automatic .Lit extractor for the iLiad	Adam B.	iRex	34	09-25-2008 07:20 PM
kovidgoyal: templatemaker -- automatic data extractor	sammykrupa	Sony Reader	1	07-21-2007 01:52 PM
Text to RTF question.	Roy White	Sony Reader	0	05-12-2007 06:59 PM

09-03-2009, 12:58 PM	#46
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Hmmm... it seems like your suggestions would take me back down the road of (effectively) a pre-calculated database being parsed from the input text. Perhaps that means there is great enough merit to the idea that it is worth the pain of implementation? Certainly, there are processing tasks that are best handled looping through character by character. But then, I suppose, there are tasks where going word by word or sentence by sentence really would be helpful... 1) Loop through all words to try to identify instances of words that contain an apostrophe (whether at beginng, end, or penultimate position) but have not yet been identified as such. Once this is done, and oddities like << 'Tis >> are identified, the quotation mark smartening function could be made simpler by having it ignore any single quote/apostrophe that is considered to be part of a word. 2) Loop through all words to try to find words that have been accidentally run together. 3) Loop through all sentences (in the classical sense) to identify any that end abruptly and may indicate an erroneous paragraph break. --- You are probably right about keeping line-breaks intact, instead of liberally stripping them out. Regarding footnotes... the only misunderstanding is that I conceive of a footnote as being always tied to a single character, the "", "" or "1" or whatever footnote mark being an output concern and the tied-to character being whatever precedes the footnote mark. e.g.: in "exceptional service", I see the "*" character as something to remove at input time and reinsert at output time, and the footnote would be actually tied to the "l" (last letter of 'exceptional') and thus appear immediately after it. This has the benefit of removing all sign of the footnote from the text stream, so it doesn't interfere with processing. Oh, and the difficult to parse sentence about "link start" and "link end" was just my attempt to say that if there are redundant <a href=""> tags in an HTML document, treating them the same way as I treat the formatting would automatically simplify them. --- Regarding the internationalization... I think perhaps all I need to do is to build the skeleton in such a way that language-specific processing functions in .py files can override generic processing functions in the main file. And, obviously, ensure that when the language of a given text is known, to only allow the correct language's processing functions to override. Shouldn't be too difficult, thanks to Eval. Sort of a minimalist plug-in-like architecture would result, I suppose. --- Ultimately, I'm now thinking the way to do things is to have pTome basically contain a linked list of pBlocks, the classification of which could range from 'line-break' to 'chapter title' to 'paragraph', et cetera. Then the pBlocks in turn would have text subdivided into one or more pParts, some of which may be classified as "sentence" or "poem line" et cetera, and each of which in turn would contain one or more pItems (which would be, as per your own thinking, my current sort of pStrings or something like it--everything else being mostly for containment and classification/categorization) that would be classified as "space" or "punctuation" or "word" et cetera. And when a change is made, only the specific pBlock, pPart, or pItem would have to be changed and all levels underneath regenerated via reparsing. Makes sense? This way, I theoretically ought to be able to loop through all words within all sentences, while still querying "what character is before this word" or "does the sentence previous to this one end with punctuation/the way a proper sentence should". - Ahi

09-12-2009, 03:44 PM	#49
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	I am beginning to make some progress... Operator overloading is turning out to be tolerably good in python. Any hints as to how to make a Python class immutable? My googling thus far suggests that there is no sane, simple way that is universal... and in most instances deriving your class from an already immutable one is suggested. Since I do not know what all I'd need to override, it seems saner to do things from scratch... but immutability is necessary, I think. - Ahi

09-12-2009, 04:33 PM	#51
kovidgoyal creator of calibre Posts: 44,017 Karma: 22669822 Join Date: Oct 2006 Location: Mumbai, India Device: Various	http://en.wikipedia.org/wiki/Immutable_object#Python Also python has a lot of immutable builtin objects like tuple and frozenset that can be inherited from

09-13-2009, 03:49 PM	#52
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Thanks, Kovid. I believe I managed to solve the problem. Do you have any resources you'd recommend relating to UTF-8 handling/character encoding conversion with Python? I keep bumping into UnicodeDecodeError style messages... and I am yet to really grasp the elegant way to avoid them. - Ahi

Advert

Advert