pacify.py (Text reformatter / RTF extractor) - Page 2

ahi · 09-01-2009, 09:48 AM

Quote:

Originally Posted by Jellby

No, it was an example of a likely false positive with the rules above

Granted, you can eliminate false positives with your two pass method, but there could be literally hundreds of them, often many more than real "internal dialogue" phrases.

As for false negatives, I often find dialogues (internal or not) that just omit the "he said", "she thought", etc. words. One should also look for "he said to himself" or "he wondered", or "he secretly admited", etc.

An automated tool can be of some help, but the danger is letting the user rely solely on the tool, which can be worse than just leaving the "internal dialogues" unformatted. Similarly, when I see curly quotes wrongly oriented I would prefer they had been left as straight quotes instead.

You are completely right.

My hope is to make pacify into a tool that can alleviate a lot of eBook preparation "monkey work" from my shoulders. But yes, it's important to be clear which operations are reasonably likely to be foolproof, and which ones are sure to require thorough looking-over by a human being.

If somebody wants all internal dialogue typeset though... even fixing just 50% through poor pattern-matching/semi-automation reduces the outstanding work considerably.

Not something I'd ever do for myself, to be honest... but, so long as one understands what one is looking for, and what the limitations are, probably not hard to throw together something that would be helpful.

- Ahi

ahi · 09-01-2009, 10:18 AM

Can I run something by you, Jellby? (And by whoever else may be reading.)

A lot of what I am trying to do with pacify.py is going to be text processing... but at the same time, I do want to be able to handle some light formatting--bold, italics, maybe a bit more.

Unfortunately any obvious/straightforward way of handling formatting interferes with the straightforwardness of any text processing. e.g.: Once I have html tags, or html entities, or latex commands in there... it begins to get harder to find out what the first character of the subsequent paragraph is, for example, on account of having to escape the formatting portions.

I have a vague idea in my head about creating a class in python that would facilitate both formatting and text processing concerns, by keeping content in the following manner:

For any string of length X, it would store two strings of length X. The first stored string would be the plaintext, the second stored string would be byte-long bitfields that provide formatting information.

Or, to give a dumbed-down view, instead of:

Code:

Isn't <i>that</i> the reason we're <b>here</b>?

would be:

Code:


String 1: Isn't that the reason we're here?
String 2: 000000IIII000000000000000000BBBB0

And then any operation done on the plaintext (via the class's methods) would perform the equivalent operation on the formatting string. This way content and formatting could be dealt with separately without having to painstakingly escape formatting instructions for any text-processing operation.

What do you think? Any chance that this is a better way than the obvious alternative of using HTML or something similar internally?

- Ahi

Jellby · 09-01-2009, 11:11 AM

I won't claim I have any authority in that field, because I haven't. I have no real experience in programming (other than some scientific samples in fortran). But two thoughts occur to me:

1. Isn't that roughly what the recent patent conflict with MS-Word was about?

2. Wouldn't you need a too large "byte" size for the format string? It's simple for just italic and bold, but how do you store bold-italic? How do you store bold, italic, underlined, red and large size? If your goal is supporting only basic stuff (like just bold and italic) then it's probably fine, but I suspect almost any other alternative would be equally fine...

ekaser · 09-01-2009, 11:12 AM

Quote:

Originally Posted by ahi

A lot of what I am trying to do with pacify.py is going to be text processing... but at the same time, I do want to be able to handle some light formatting--bold, italics, maybe a bit more.

Unfortunately any obvious/straightforward way of handling formatting interferes with the straightforwardness of any text processing.

I have a vague idea in my head about creating a class in python that would facilitate both formatting and text processing concerns, by keeping content in the following manner:

For any string of length X, it would store two strings of length X. The first stored string would be the plaintext, the second stored string would be byte-long bitfields that provide formatting information.

And then any operation done on the plaintext (via the class's methods) would perform the equivalent operation on the formatting string. This way content and formatting could be dealt with separately without having to painstakingly escape formatting instructions for any text-processing operation.

Are you aiming this completely at English? If not, if you think you or someone else might want to adapt it to other languages at some point, you might want to use WORD arrays from the start rather than BYTE arrays, so that UNICODE or other character sets could be adopted at some point more easily. That would also give you a few more "formatting options" with 16 flags instead of just 8.

Every data storage method has its advantages and disadvantages. For this type of data-stream/formatting combination, you've pretty much got:
1) in-stream (data and formatting mixed in same stream of bytes)
2) parallel streams (what you're considering)
3) in-stream flags (a combo of 1) and 2) with wider 'bytes' (WORDS or DWORDS) with flags in the upper bits.
4) packets (blocks of text with common formatting)
5) stream and heap-of-format-pointers

and probably several other convoluted methods. Which works best depends a great deal upon what your 'application' needs to accomplish. An application that primarily has to DISPLAY the data might work better with 4) or 5), whereas an application that does NOT need to display the data will probably work better with one of the others, and which one of them will depend upon the nature of the processing that's being done. 4) and 5) are more memory efficient, but more code complex.

For what I THINK it is you're trying to accomplish (primarily file format shifting of fairly simple text files), then what you suggest should work quite well, since memory usage is generally no longer such an issue. When memory was less ... abundant, then code complexity was often the sacrificial lamb to memory usage.

ahi · 09-01-2009, 11:18 AM

Quote:

Originally Posted by Jellby

I won't claim I have any authority in that field, because I haven't. I have no real experience in programming (other than some scientific samples in fortran). But two thoughts occur to me:

1. Isn't that roughly what the recent patent conflict with MS-Word was about?

2. Wouldn't you need a too large "byte" size for the format string? It's simple for just italic and bold, but how do you store bold-italic? How do you store bold, italic, underlined, red and large size? If your goal is supporting only basic stuff (like just bold and italic) then it's probably fine, but I suspect almost any other alternative would be equally fine...

Can you tell me more about (1)? I'm oblivious.

Regarding 2, the second "string" could be a list instead, if need be, with the number of the list item corresponding to the byte-position in the plaintext. But a single byte, used as a bitfield, is sufficient for 8 distinct on-or-off states.

My primary aim at this time is to convert RTF into HTML or LaTeX. Given that some of those RTFs have a lot of extraneous formatting information (usually relating to minimally [and needlessly] varying font-size, and similar things) that would be literally harmful to include in the output in most cases, I would probably focus only on bold, italics, small caps, and colour. With such a combination, the output would be reasonably clean, contain no excess/disruptive (mis)formatting, and yield itself well to trying to figure out what is regular text, and what is something other than.

I should probably include font-size in the formatting list as well... but I'm almost certain I don't need exact font sizes, but rather a more fuzzy determination as to whether the font size is small, regular, or large.

- Ahi

ahi · 09-01-2009, 11:31 AM

Quote:

Originally Posted by ekaser

Are you aiming this completely at English? If not, if you think you or someone else might want to adapt it to other languages at some point, you might want to use WORD arrays from the start rather than BYTE arrays, so that UNICODE or other character sets could be adopted at some point more easily. That would also give you a few more "formatting options" with 16 flags instead of just 8.

Every data storage method has its advantages and disadvantages. For this type of data-stream/formatting combination, you've pretty much got:
1) in-stream (data and formatting mixed in same stream of bytes)
2) parallel streams (what you're considering)
3) in-stream flags (a combo of 1) and 2) with wider 'bytes' (WORDS or DWORDS) with flags in the upper bits.
4) packets (blocks of text with common formatting)
5) stream and heap-of-format-pointers

and probably several other convoluted methods. Which works best depends a great deal upon what your 'application' needs to accomplish. An application that primarily has to DISPLAY the data might work better with 4) or 5), whereas an application that does NOT need to display the data will probably work better with one of the others, and which one of them will depend upon the nature of the processing that's being done. 4) and 5) are more memory efficient, but more code complex.

For what I THINK it is you're trying to accomplish (primarily file format shifting of fairly simple text files), then what you suggest should work quite well, since memory usage is generally no longer such an issue. When memory was less ... abundant, then code complexity was often the sacrificial lamb to memory usage.

I oversimplified without saying so. I am in fact aiming this to be reasonably international... or at least have the potential to be so.

I am using UTF-8 presently... but might ultimately need to switch to literally using lists cart-blanche instead of strings, as some of my processing needs relate to Unicode Extension Plane B Chinese characters that use 3 (or 4? I forget) bytes in UTF-8 to represent a single display-character... and python, even in UTF-8 mode, treats them at least as two separate characters. If I go this route, doubtless I will be choosing reliability over speed in a big way.

And actually, while you're here, though memory shouldn't be an issue... I have 3 GB RAM and 32 GB swap space under my Linux setup, I keep getting Python memory errors when trying to process RTF files between 400 MB - 1 GB in size.

Yes, yes... "Duh!", I know. But is there a way to increase Python's memory limit? My system can obviously take it... as less than a tenth of the swap is used before Python dies... so I'd really like to force Python to process these huge behemoths... even if it slows my system to a crawl for minutes or even hours.

Any tips for me, ekaser?

- Ahi

Jellby · 09-01-2009, 11:36 AM

Quote:

Originally Posted by ahi

Can you tell me more about (1)? I'm oblivious.

I think it was this.

ahi · 09-01-2009, 11:58 AM

Quote:

Originally Posted by Jellby

I think it was this.

Hmmm... one would think it doesn't cover parallel streams in general. What do you think, ekaser?

And the formatting instructions wouldn't even map 1:1 to either HTML or XML... the latter, because, I have no plans to generate XML with pacify at all.

Certainly a bit too similar for comfort, Jellby. Thanks for pointing it out.

- Ahi

ekaser · 09-01-2009, 12:03 PM

Quote:

Originally Posted by ahi

Though memory shouldn't be an issue... I have 3 GB RAM and 32 GB swap space under my Linux setup, I keep getting Python memory errors when trying to process RTF files between 400 MB - 1 GB in size.

Yes, yes... "Duh!", I know. But is there a way to increase Python's memory limit? My system can obviously take it... as less than a tenth of the swap is used before Python dies... so I'd really like to force Python to process these huge behemoths... even if it slows my system to a crawl for minutes or even hours.

Sorry, I'm not a Python expert (just starting on it, really, I'm a long-time C guy), so I can't help you with Python. But the old, tried and true method, is to not read the whole thing in all at once, read it in chunks, process that chunk, when you get "close to the end" of the chunk, move it up and refill the queue with the next chunk from the file and keep going. Of course, that works better with some 'things' than others, but I would think it would work reasonably well with .rtf text files, which are pretty linear beasts. You might have to keep around a 'stack' of "open blocks" for text that's long since been flushed from the processing queue, so that you know what's pending when you reach the end of that block in the queue, but probably not. If you make the processing queue sufficiently large (4M? 8M? 16M 32M? any of those would probably be plenty big and would avoid the "memory issues"), then you could update/refill the queue at opportune moments. In managing the queue, you can either move the unused portion up and then refill from there to the end of the queue, or just keep pointers to the start and end of the unprocessed portion, and refilling the queue then involves two reads, the first to fill the tail-end portion of the unfilled queue and the second to fill the front-end unused portion. If/when speed of processing is not an issue (which I don't think it is in your case), then move-and-fill is preferred, because it makes the rest of the code MUCH simpler. With "rotating pointers", you're constantly checking for reaching the end of the queue and whether the end pointer is greater or lesser than the start pointer and such. PITA.

ahi · 09-01-2009, 12:10 PM

Quote:

Originally Posted by ekaser

Sorry, I'm not a Python expert (just starting on it, really, I'm a long-time C guy), so I can't help you with Python. But the old, tried and true method, is to not read the whole thing in all at once, read it in chunks, process that chunk, when you get "close to the end" of the chunk, move it up and refill the queue with the next chunk from the file and keep going. Of course, that works better with some 'things' than others, but I would think it would work reasonably well with .rtf text files, which are pretty linear beasts. You might have to keep around a 'stack' of "open blocks" for text that's long since been flushed from the processing queue, so that you know what's pending when you reach the end of that block in the queue, but probably not. If you make the processing queue sufficiently large (4M? 8M? 16M 32M? any of those would probably be plenty big and would avoid the "memory issues"), then you could update/refill the queue at opportune moments. In managing the queue, you can either move the unused portion up and then refill from there to the end of the queue, or just keep pointers to the start and end of the unprocessed portion, and refilling the queue then involves two reads, the first to fill the tail-end portion of the unfilled queue and the second to fill the front-end unused portion. If/when speed of processing is not an issue (which I don't think it is in your case), then move-and-fill is preferred, because it makes the rest of the code MUCH simpler. With "rotating pointers", you're constantly checking for reaching the end of the queue and whether the end pointer is greater or lesser than the start pointer and such. PITA.

I might have to resort to that... I'm just irked that I have to do any workaround for code that clearly works fine with smaller files, and even with the larger files are not actually exceeding (or even approaching) my hardware's memory limits.

But you are right. It shouldn't be too hard to do.

- Ahi

ekaser · 09-01-2009, 12:19 PM

Quote:

Originally Posted by ahi

Hmmm... one would think it doesn't cover parallel streams in general. What do you think, ekaser?

And the formatting instructions wouldn't even map 1:1 to either HTML or XML... the latter, because, I have no plans to generate XML with pacify at all.

IANAL, but:
1) I really think it's not close enough to matter.
2) As long as you're not planning on selling it, no one cares.
3) If you do sell it, you're not going to make enough money for anyone to care.
4) If you do sell it and make as much money as Microsoft, YOU won't care.

Lose no sleep. IMHO.

(Software patents spew, but that's just my opinion...)

Jellby · 09-01-2009, 12:24 PM

I basically agree with ekaser. I didn't mention the patent issue to scare you, I just thought the relation was interesting, like a déjà vu sort of thing.

ahi · 09-01-2009, 12:28 PM

Thanks, both of you! And yes, it is interesting.

It just occurred to me that this sort of approach to formatting might also effectively simplify formatting code.

I have RTF files that seem full of unnecessary bold commands... bold being set 5-6 times in a row, for what is basically a single contiguous bolded portion. Converted into this parallel stream format, when time comes to reencode into HTML or LaTeX, that sort of a mess would only generate a single <b>...</b> or \textbf{...}.

Which is fortuitous, because I've also been thinking of how formatting could be refactored so as to remove redundancy.

- Ahi

ekaser · 09-01-2009, 01:29 PM

Quote:

Originally Posted by ahi

It just occurred to me that this sort of approach to formatting might also effectively simplify formatting code.

Very true, a built-in form of TIDY. It might be worthwhile to add HTML input (LaTeX, I don't care about personally, and I'm not sure how many folks really use it or would use it, but that's neither here nor there). Right now you have .TXT and .RTF input and .HTML and LaTeX output. Why not make it 'orthogonal', all any of the four input and any of the four output. That would, in essence, make it a 'tidy' and conversion program all in one. Just a thought.

EDIT: Note, I'm not suggesting you include a 'general' HTML or LaTeX parser, anymore than you're going to have a general RTF parser, just the "basic stuff" that you want to keep and throw everything else away. Sure, some files it would make a mess of, but those files probably wouldn't be appropriate for this style of conversion either. I'm assuming this is aimed at "simple novel" types of books that don't have a lot of fancy formatting to start with. One thought: since you're taking RTF as input files, some of those will have images (covers, maps, etc), so I'm hoping that those image tags would be maintained along with the bold, italic, etc, right? That would imply the need to be able to include a "numbered mark" in the formatting string. Perhaps if the most significant bit of the formatting 'character' was set, then the lower bits are the 'number' of the image (on the "image stack") that should be inserted at that point. Of course, that then also brings up the question of image positioning: left, center, right.

ahi · 09-01-2009, 01:42 PM

Quote:

Originally Posted by ekaser

Very true, a built-in form of TIDY. It might be worthwhile to add HTML input (LaTeX, I don't care about personally, and I'm not sure how many folks really use it or would use it, but that's neither here nor there). Right now you have .TXT and .RTF input and .HTML and LaTeX output. Why not make it 'orthogonal', all any of the four input and any of the four output. That would, in essence, make it a 'tidy' and conversion program all in one. Just a thought.

Probably the approach I will take, ekaser.

I've just been itching to get working code and, in retrospect, have churned out messy and architecturally lazy code in order to do so.

I think I need to redo the architecture in order to have a clean "intake" portion that converts input files into the internal format, a processing portion that does whatever needs to be done, and an output portion that converts the internal format into the chosen output format.

Once I have that, a good deal of the existing code can be plugged in there without too much modification, and adding additional input or output formats can also be done without any complication of the existing code.

And I can already think of somebody who would be very pleased to have minimalist RTFs generated from existing more complex ones.

- Ahi

09-01-2009, 10:18 AM	#17
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Can I run something by you, Jellby? (And by whoever else may be reading.) A lot of what I am trying to do with pacify.py is going to be text processing... but at the same time, I do want to be able to handle some light formatting--bold, italics, maybe a bit more. Unfortunately any obvious/straightforward way of handling formatting interferes with the straightforwardness of any text processing. e.g.: Once I have html tags, or html entities, or latex commands in there... it begins to get harder to find out what the first character of the subsequent paragraph is, for example, on account of having to escape the formatting portions. I have a vague idea in my head about creating a class in python that would facilitate both formatting and text processing concerns, by keeping content in the following manner: For any string of length X, it would store two strings of length X. The first stored string would be the plaintext, the second stored string would be byte-long bitfields that provide formatting information. Or, to give a dumbed-down view, instead of: Code: Isn't <i>that</i> the reason we're <b>here</b>? would be: Code: String 1: Isn't that the reason we're here? String 2: 000000IIII000000000000000000BBBB0 And then any operation done on the plaintext (via the class's methods) would perform the equivalent operation on the formatting string. This way content and formatting could be dealt with separately without having to painstakingly escape formatting instructions for any text-processing operation. What do you think? Any chance that this is a better way than the obvious alternative of using HTML or something similar internally? - Ahi

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Best pdf to text/rtf/whatever I have ever seen	jblitereader	Ectaco jetBook	13	07-10-2010 12:02 AM
RTF and TEXT conversion	spaze	Calibre	4	08-23-2009 03:11 AM
Automatic .Lit extractor for the iLiad	Adam B.	iRex	34	09-25-2008 07:20 PM
kovidgoyal: templatemaker -- automatic data extractor	sammykrupa	Sony Reader	1	07-21-2007 01:52 PM
Text to RTF question.	Roy White	Sony Reader	0	05-12-2007 06:59 PM

09-01-2009, 11:11 AM	#18
Jellby frumious Bandersnatch Posts: 7,516 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	I won't claim I have any authority in that field, because I haven't. I have no real experience in programming (other than some scientific samples in fortran). But two thoughts occur to me: 1. Isn't that roughly what the recent patent conflict with MS-Word was about? 2. Wouldn't you need a too large "byte" size for the format string? It's simple for just italic and bold, but how do you store bold-italic? How do you store bold, italic, underlined, red and large size? If your goal is supporting only basic stuff (like just bold and italic) then it's probably fine, but I suspect almost any other alternative would be equally fine...

09-01-2009, 12:24 PM	#27
Jellby frumious Bandersnatch Posts: 7,516 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	I basically agree with ekaser. I didn't mention the patent issue to scare you, I just thought the relation was interesting, like a déjà vu sort of thing.

09-01-2009, 12:28 PM	#28
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Thanks, both of you! And yes, it is interesting. It just occurred to me that this sort of approach to formatting might also effectively simplify formatting code. I have RTF files that seem full of unnecessary bold commands... bold being set 5-6 times in a row, for what is basically a single contiguous bolded portion. Converted into this parallel stream format, when time comes to reencode into HTML or LaTeX, that sort of a mess would only generate a single <b>...</b> or \textbf{...}. Which is fortuitous, because I've also been thinking of how formatting could be refactored so as to remove redundancy. - Ahi

Advert

Advert