pacify.py (Text reformatter / RTF extractor) - Page 6

ahi · 09-24-2009, 10:24 AM

I suppose I could cop-out of this problem by generating no braces, and \itshape{} \bfseries{} for the start of formatting sections and \upshape{} and \mdseries{} for the closing of formatting sections.

Not very elegant though...

- Ahi

Jellby · 09-24-2009, 10:43 AM

How would you know the "correct" output is :

Code:

This \textit{is} \textbf{\textit{in}deed} a strange idea!

and not:

Code:

This \textit{is \textbf{in}}\textbf{deed} a strange idea!

?

Of course, the real output, after a LaTeX run, would be indistinguishable. (Note that whether the whitespace must be italic or not may be debatable, but you should probably keep whatever was in the input file)

I don't know what would be the "canonical" way of dealing with this, but I'd say you'll have to check proper nesting when generating LaTeX code: Whenever a feature is deactivated, check if it's the innermost feature (the last one to have been activated; if it is, close the brace, if it isn't, close the the inner features' braces, close the brace, and open the inner features again.

frabjous · 09-24-2009, 10:44 AM

Could you have the program keep track of how many open braces there are, and then, when there's any kind of change, close them all, and then reopen the ongoing ones?

To get:

This is indeed a strange idea.

Then, you would have:

Code:

This \textit{is }\textit{\textbf{in}}\textbf{deed} a strange idea.

The downside would be that for properly nested elements, for e.g.:

This is a more normal sentence.

you'd get:

Code:

This \textit{is a }\textit{\textbf{more}}\textit{ normal} sentence.

rather than the more elegant:

Code:

This \textit{is a \textbf{more} normal} sentence.

But I think that's OK.

I would think you'd want to do something similar for HTML anyway, since overlapping rather than nested tags, such as:

Code:

This <i>is <b>in</i>deed</b> a strange idea.

...although some browsers may support it, is not considered proper HTML, and is definitely XHTML invalid. (Or at least W3's HTML validator says so.)

and it would be better to have:

Code:

This <i>is </i><i><b>in</b></i><b>deed</b> a strange idea.

ahi · 09-24-2009, 11:09 AM

Quote:

Originally Posted by frabjous

Could you have the program keep track of how many open braces there are, and then, when there's any kind of change, close them all, and then reopen the ongoing ones?

Yeah, that might be the trick. It still feels like a bit of a cop-out... but (EDIT->) not a terrible one.

I'm a little surprised that doing the ideal thing seems to be a fairly non-straightforward problem.

I'm happy to report though that the development version I am working on really seems to be free of unicode errors, and is shaping up to work remarkably well.

Thanks to HTML's <H1> ... <H6> tags, pacify.py should be able to convert cleanly formatted HTML files well-nigh directly into PDF via LaTeX.

I'm also on the verge of starting to add interactive processing algorithms... (which do clean-up and/or address ambiguous cases after automated processing, and which can be disabled)

The first interactive plugin (or rather interactive portion of a plugin) will be for detecting errors/problems with auto-smartened quotation marks.

i.e.: If number of open quotation marks and closed quotation marks do not add up [unless it's a multi-paragraph quotation] or open/close incorrectly, ask the user for advice on what to do.

The second one I plan to work on will try to autodetect chapter/setion/et cetera headers when they are imported from RTF or plaintext files (in which cases they are not as unambiguous as when imported from HTML that uses H1 ... H6).

- Ahi

ahi · 09-24-2009, 11:19 AM

Quote:

Originally Posted by Jellby

How would you know the "correct" output is :

Code:

This \textit{is} \textbf{\textit{in}deed} a strange idea!

and not:

Code:

This \textit{is \textbf{in}}\textbf{deed} a strange idea!

?

Of course, the real output, after a LaTeX run, would be indistinguishable. (Note that whether the whitespace must be italic or not may be debatable, but you should probably keep whatever was in the input file)

Hmmm... spaces should only remain "formatted" if both the previous and the next character is formatted exactly the same way. (This is taken care of by a formatting normalization plugin... and I failed to indicate this in my much simplified code.)

My example was therefore incorrect. But the same problem exists with this slight modification that is legal/plausible within the framework of pacify.

Code:


T  h  i  s     i  s     i  n  d  e  e  d     a     s  t  r  a  n  g  e     i  d  e  a  !  
-- -- -- -- -- -I -I -I -I BI BI B- B- B- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --

Quote:

Originally Posted by Jellby

I don't know what would be the "canonical" way of dealing with this, but I'd say you'll have to check proper nesting when generating LaTeX code: Whenever a feature is deactivated, check if it's the innermost feature (the last one to have been activated; if it is, close the brace, if it isn't, close the the inner features' braces, close the brace, and open the inner features again.

I might give this a try first actually. Thanks, Jellby.

- Ahi

Jellby · 09-24-2009, 12:07 PM

Note that the real intent of my example was to compare:

Code:

\textit{italic} \textbf{\textit{bold italic} bold}

with:

Code:

\textit{italic \textbf{bold italic}} \textbf{bold}

i.e., what is outside the rest? The \textit at the beginning or the \textbf at the end?

ahi · 09-24-2009, 12:29 PM

This is a substantially different example now though... with four (space-separated) holistically formatted words, instead of two words with formatting change mid-word for one.

I believe current incarnations of my program would generate:

Code:

\textit{italic} \textbf{\textit{bold italic}} \textbf{bold}

]

or

Code:

\textit{italic} \textit{\textbf{bold italic}} \textbf{bold}

]

So... neither, I guess? The formatting preprocessor would "blank" the formatting of the space after the first italic, and the space before the second bold due to a lack of uniform formatting on both sides of the space character.

- Ahi

Jellby · 09-24-2009, 01:55 PM

OK, OK... I only put the spaces there to make it clearer, but it seem's I'm just messing it. This is what I mean:

texttexttext

texttexttext

Both look the same, and share the same "formatting bits", but are coded differently (use the "quote" button to see it).

ahi · 09-24-2009, 02:05 PM

Quote:

Originally Posted by Jellby

OK, OK... I only put the spaces there to make it clearer, but it seem's I'm just messing it. This is what I mean:

texttexttext

texttexttext

Both look the same, and share the same "formatting bits", but are coded differently (use the "quote" button to see it).

The correct/ideal output would be:

\texit{text\textbf{text}}\textbf{text}

i.e.: the second one, as per your examples above.

Why? Simply because our progression is from left to right, I think.

- Ahi

ekaser · 09-24-2009, 03:48 PM

Quote:

Originally Posted by ahi

The correct/ideal output would be:

\texit{text\textbf{text}}\textbf{text}

i.e.: the second one, as per your examples above.

Why? Simply because our progression is from left to right, I think.

Which certainly generates the 'smallest' output code, and most efficient. To do that, of course, as someone else mentioned, means that you have to keep a "format stack" so that you know which format was turned on in which order, so that you can properly "back them off" in the right order.

Of course, as you're aware, there WILL be "improperly formatted" HTML/RTF input files where the italics/bold formatting overlaps and they're not turned off/on cleanly. In that case, you SHOULD convert it to clean formatting. ie, using ()'s instead of []'s for visuals:

text(I)text(B)text(/I)text(/B) ==> texttexttexttext

should be "cleaned up" to:

text(I)text(B)text(/B)(/I)(B)text(/B) ==> texttexttexttext

At least, IMHO.

ahi · 09-24-2009, 03:57 PM

Quote:

Originally Posted by ekaser

Which certainly generates the 'smallest' output code, and most efficient. To do that, of course, as someone else mentioned, means that you have to keep a "format stack" so that you know which format was turned on in which order, so that you can properly "back them off" in the right order.

Of course, as you're aware, there WILL be "improperly formatted" HTML/RTF input files where the italics/bold formatting overlaps and they're not turned off/on cleanly. In that case, you SHOULD convert it to clean formatting. ie, using ()'s instead of []'s for visuals:

text(I)text(B)text(/I)text(/B) ==> texttexttexttext

should be "cleaned up" to:

text(I)text(B)text(/B)(/I)(B)text(/B) ==> texttexttexttext

At least, IMHO.

Remember, ekaser, some of this will happen automagically from (1) the way I keep track of formatting [i.e.: the parallel stream simplifies stuff to begin with] and (2) the formatting normalization plugin [which simplifies stuff a bit further... mostly by blanking formatting for newline characters and spaces standing between non-same formatted other characters].

But yeah... I'll give the format stack solution a shot and see what I manage.

- Ahi

ahi · 09-28-2009, 12:09 PM

The .tex output works fine now. I'm moving on to getting the HTML output to work at least as well as the .tex one does.

After that... I want to add some minimal image handling, and some interactive chapter "detection" to help mold the output.

Once I have those, I will upload.

In the meantime, if anybody has suggestions with regards to how I should handle tables... keeping in mind that my internal representation is basically plaintext with formatting/classification information attached on a character by character basis.

- Ahi

frabjous · 09-28-2009, 12:25 PM

Can you flag a character as "beginning a new cell" and/or "beginning a new row"? (Inserting a tab before the former, and a linefeed before the latter may be sufficient for plain text output.)

I think the basic idea of the script is consistent with simply stripping things like the lines and border styles around the tables and between the cells. Losing column alignment is a bit more of a cost, but maybe that can be preserved somehow?

ahi · 09-28-2009, 01:46 PM

Quote:

Originally Posted by frabjous

Can you flag a character as "beginning a new cell" and/or "beginning a new row"? (Inserting a tab before the former, and a linefeed before the latter may be sufficient for plain text output.)

I think the basic idea of the script is consistent with simply stripping things like the lines and border styles around the tables and between the cells. Losing column alignment is a bit more of a cost, but maybe that can be preserved somehow?

Alignment in general is a bit of an issue...

Bold and italic text is the sort of thing that one can reasonably assume that the source documents uses "correctly" (for a reasonably broad definition of "correct"). Alignment tomfoolery, however, is used for different things that *correctly* ought to be handled in different ways.

Just in the eBooks I've been playing around with thus far...

Centred text can mean a chapter, a subtitle, a chapter summary, book metadata, et cetera.

Right-aligned text can mean an epigraph, a signature, a date, et cetera.

When outputting HTML, arguably the limitations of the output format mean that simply centering or right-aligning the text as it was in the source is good enough. But for LaTeX output, it would be much preferable to handle each of those different things correctly in terms of the LaTeX's memoir class.

Admittedly perhaps cell alignment in a table is on par with bold/italic formatting in a paragraph... one can trust that it is correct as is, and needs no context-dependent special handling.

I think I need to rethink how the formatting/classification is handled. (Fortunately it won't be too much work to fix/update.)

I think I need to separate formatting from classification (and from footnotes/annotations/et cetera) like I originally intended. Formatting needs to be handled and mangled on its own, unfettered by miscellaneous non-formatting stuff.

I am actually starting to think that the power of pacify will ultimately derive from the simplicity of its approach of dealing with (mostly) one thing at a time: either the text, the formatting, or the content classification.

---

And, to answer your question, yes, marking table structure/table cells in the classification layer/stream is probably the right approach... which takes pacify toward its natural conclusion of using the text and formatting layer to generate the classification layer, but using only the text and classification layer (i.e.: not the formatting layer) for generating its output. For the simplest stuff (bold/italics) the formatting and classification layer will more or less encode the same information, but the classification layer should ultimately know even chapters, poems, et cetera from regular text.

- Ahi

09-24-2009, 10:43 AM	#77
Jellby frumious Bandersnatch Posts: 7,516 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	How would you know the "correct" output is : Code: This \textit{is} \textbf{\textit{in}deed} a strange idea! and not: Code: This \textit{is \textbf{in}}\textbf{deed} a strange idea! ? Of course, the real output, after a LaTeX run, would be indistinguishable. (Note that whether the whitespace must be italic or not may be debatable, but you should probably keep whatever was in the input file) I don't know what would be the "canonical" way of dealing with this, but I'd say you'll have to check proper nesting when generating LaTeX code: Whenever a feature is deactivated, check if it's the innermost feature (the last one to have been activated; if it is, close the brace, if it isn't, close the the inner features' braces, close the brace, and open the inner features again.

09-24-2009, 10:44 AM	#78
frabjous Wizard Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505	Could you have the program keep track of how many open braces there are, and then, when there's any kind of change, close them all, and then reopen the ongoing ones? To get: This is indeed a strange idea. Then, you would have: Code: This \textit{is }\textit{\textbf{in}}\textbf{deed} a strange idea. The downside would be that for properly nested elements, for e.g.: This is a more* normal* sentence. you'd get: Code: This \textit{is a }\textit{\textbf{more}}\textit{ normal} sentence. rather than the more elegant: Code: This \textit{is a \textbf{more} normal} sentence. But I think that's OK. I would think you'd want to do something similar for HTML anyway, since overlapping rather than nested tags, such as: Code: This <i>is <b>in</i>deed</b> a strange idea. ...although some browsers may support it, is not considered proper HTML, and is definitely XHTML invalid. (Or at least W3's HTML validator says so.) and it would be better to have: Code: This <i>is </i><i><b>in</b></i><b>deed</b> a strange idea.

09-24-2009, 12:07 PM	#81
Jellby frumious Bandersnatch Posts: 7,516 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	Note that the real intent of my example was to compare: Code: \textit{italic} \textbf{\textit{bold italic} bold} with: Code: \textit{italic \textbf{bold italic}} \textbf{bold} i.e., what is outside the rest? The \textit at the beginning or the \textbf at the end?

09-24-2009, 12:29 PM	#82
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	This is a substantially different example now though... with four (space-separated) holistically formatted words, instead of two words with formatting change mid-word for one. I believe current incarnations of my program would generate: Code: \textit{italic} \textbf{\textit{bold italic}} \textbf{bold} ] or Code: \textit{italic} \textit{\textbf{bold italic}} \textbf{bold} ] So... neither, I guess? The formatting preprocessor would "blank" the formatting of the space after the first italic, and the space before the second bold due to a lack of uniform formatting on both sides of the space character. - Ahi

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Best pdf to text/rtf/whatever I have ever seen	jblitereader	Ectaco jetBook	13	07-10-2010 12:02 AM
RTF and TEXT conversion	spaze	Calibre	4	08-23-2009 03:11 AM
Automatic .Lit extractor for the iLiad	Adam B.	iRex	34	09-25-2008 07:20 PM
kovidgoyal: templatemaker -- automatic data extractor	sammykrupa	Sony Reader	1	07-21-2007 01:52 PM
Text to RTF question.	Roy White	Sony Reader	0	05-12-2007 06:59 PM

09-24-2009, 10:24 AM	#76
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	I suppose I could cop-out of this problem by generating no braces, and \itshape{} \bfseries{} for the start of formatting sections and \upshape{} and \mdseries{} for the closing of formatting sections. Not very elegant though... - Ahi

09-24-2009, 01:55 PM	#83
Jellby frumious Bandersnatch Posts: 7,516 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	OK, OK... I only put the spaces there to make it clearer, but it seem's I'm just messing it. This is what I mean: texttexttext texttexttext Both look the same, and share the same "formatting bits", but are coded differently (use the "quote" button to see it).

09-28-2009, 12:09 PM	#87
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	The .tex output works fine now. I'm moving on to getting the HTML output to work at least as well as the .tex one does. After that... I want to add some minimal image handling, and some interactive chapter "detection" to help mold the output. Once I have those, I will upload. In the meantime, if anybody has suggestions with regards to how I should handle tables... keeping in mind that my internal representation is basically plaintext with formatting/classification information attached on a character by character basis. - Ahi

09-28-2009, 12:25 PM	#88
frabjous Wizard Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505	Can you flag a character as "beginning a new cell" and/or "beginning a new row"? (Inserting a tab before the former, and a linefeed before the latter may be sufficient for plain text output.) I think the basic idea of the script is consistent with simply stripping things like the lines and border styles around the tables and between the cells. Losing column alignment is a bit more of a cost, but maybe that can be preserved somehow?