Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 09-24-2009, 10:24 AM   #76
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
I suppose I could cop-out of this problem by generating no braces, and \itshape{} \bfseries{} for the start of formatting sections and \upshape{} and \mdseries{} for the closing of formatting sections.

Not very elegant though...

- Ahi
ahi is offline   Reply With Quote
Old 09-24-2009, 10:43 AM   #77
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
How would you know the "correct" output is :

Code:
This \textit{is} \textbf{\textit{in}deed} a strange idea!
and not:

Code:
This \textit{is \textbf{in}}\textbf{deed} a strange idea!
?

Of course, the real output, after a LaTeX run, would be indistinguishable. (Note that whether the whitespace must be italic or not may be debatable, but you should probably keep whatever was in the input file)

I don't know what would be the "canonical" way of dealing with this, but I'd say you'll have to check proper nesting when generating LaTeX code: Whenever a feature is deactivated, check if it's the innermost feature (the last one to have been activated; if it is, close the brace, if it isn't, close the the inner features' braces, close the brace, and open the inner features again.
Jellby is offline   Reply With Quote
Advert
Old 09-24-2009, 10:44 AM   #78
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
Could you have the program keep track of how many open braces there are, and then, when there's any kind of change, close them all, and then reopen the ongoing ones?

To get:

This is indeed a strange idea.

Then, you would have:

Code:
This \textit{is }\textit{\textbf{in}}\textbf{deed} a strange idea.
The downside would be that for properly nested elements, for e.g.:

This is a more normal sentence.

you'd get:
Code:
This \textit{is a }\textit{\textbf{more}}\textit{ normal} sentence.
rather than the more elegant:

Code:
This \textit{is a \textbf{more} normal} sentence.
But I think that's OK.

I would think you'd want to do something similar for HTML anyway, since overlapping rather than nested tags, such as:

Code:
This <i>is <b>in</i>deed</b> a strange idea.
...although some browsers may support it, is not considered proper HTML, and is definitely XHTML invalid. (Or at least W3's HTML validator says so.)

and it would be better to have:

Code:
This <i>is </i><i><b>in</b></i><b>deed</b> a strange idea.
frabjous is offline   Reply With Quote
Old 09-24-2009, 11:09 AM   #79
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by frabjous View Post
Could you have the program keep track of how many open braces there are, and then, when there's any kind of change, close them all, and then reopen the ongoing ones?
Yeah, that might be the trick. It still feels like a bit of a cop-out... but (EDIT->) not a terrible one.

I'm a little surprised that doing the ideal thing seems to be a fairly non-straightforward problem.

I'm happy to report though that the development version I am working on really seems to be free of unicode errors, and is shaping up to work remarkably well.

Thanks to HTML's <H1> ... <H6> tags, pacify.py should be able to convert cleanly formatted HTML files well-nigh directly into PDF via LaTeX.

I'm also on the verge of starting to add interactive processing algorithms... (which do clean-up and/or address ambiguous cases after automated processing, and which can be disabled)

The first interactive plugin (or rather interactive portion of a plugin) will be for detecting errors/problems with auto-smartened quotation marks.

i.e.: If number of open quotation marks and closed quotation marks do not add up [unless it's a multi-paragraph quotation] or open/close incorrectly, ask the user for advice on what to do.

The second one I plan to work on will try to autodetect chapter/setion/et cetera headers when they are imported from RTF or plaintext files (in which cases they are not as unambiguous as when imported from HTML that uses H1 ... H6).

- Ahi

Last edited by ahi; 09-24-2009 at 11:37 AM.
ahi is offline   Reply With Quote
Old 09-24-2009, 11:19 AM   #80
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by Jellby View Post
How would you know the "correct" output is :

Code:
This \textit{is} \textbf{\textit{in}deed} a strange idea!
and not:

Code:
This \textit{is \textbf{in}}\textbf{deed} a strange idea!
?

Of course, the real output, after a LaTeX run, would be indistinguishable. (Note that whether the whitespace must be italic or not may be debatable, but you should probably keep whatever was in the input file)
Hmmm... spaces should only remain "formatted" if both the previous and the next character is formatted exactly the same way. (This is taken care of by a formatting normalization plugin... and I failed to indicate this in my much simplified code.)

My example was therefore incorrect. But the same problem exists with this slight modification that is legal/plausible within the framework of pacify.

Code:

T  h  i  s     i  s     i  n  d  e  e  d     a     s  t  r  a  n  g  e     i  d  e  a  !  
-- -- -- -- -- -I -I -I -I BI BI B- B- B- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Quote:
Originally Posted by Jellby View Post
I don't know what would be the "canonical" way of dealing with this, but I'd say you'll have to check proper nesting when generating LaTeX code: Whenever a feature is deactivated, check if it's the innermost feature (the last one to have been activated; if it is, close the brace, if it isn't, close the the inner features' braces, close the brace, and open the inner features again.
I might give this a try first actually. Thanks, Jellby.

- Ahi

Last edited by ahi; 09-24-2009 at 11:36 AM.
ahi is offline   Reply With Quote
Advert
Old 09-24-2009, 12:07 PM   #81
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Note that the real intent of my example was to compare:

Code:
\textit{italic} \textbf{\textit{bold italic} bold}
with:

Code:
\textit{italic \textbf{bold italic}} \textbf{bold}
i.e., what is outside the rest? The \textit at the beginning or the \textbf at the end?
Jellby is offline   Reply With Quote
Old 09-24-2009, 12:29 PM   #82
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
This is a substantially different example now though... with four (space-separated) holistically formatted words, instead of two words with formatting change mid-word for one.

I believe current incarnations of my program would generate:

Code:
\textit{italic} \textbf{\textit{bold italic}} \textbf{bold}
]

or

Code:
\textit{italic} \textit{\textbf{bold italic}} \textbf{bold}
]

So... neither, I guess? The formatting preprocessor would "blank" the formatting of the space after the first italic, and the space before the second bold due to a lack of uniform formatting on both sides of the space character.

- Ahi
ahi is offline   Reply With Quote
Old 09-24-2009, 01:55 PM   #83
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
OK, OK... I only put the spaces there to make it clearer, but it seem's I'm just messing it. This is what I mean:

texttexttext

texttexttext

Both look the same, and share the same "formatting bits", but are coded differently (use the "quote" button to see it).
Jellby is offline   Reply With Quote
Old 09-24-2009, 02:05 PM   #84
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by Jellby View Post
OK, OK... I only put the spaces there to make it clearer, but it seem's I'm just messing it. This is what I mean:

texttexttext

texttexttext

Both look the same, and share the same "formatting bits", but are coded differently (use the "quote" button to see it).
The correct/ideal output would be:

\texit{text\textbf{text}}\textbf{text}

i.e.: the second one, as per your examples above.

Why? Simply because our progression is from left to right, I think.

- Ahi
ahi is offline   Reply With Quote
Old 09-24-2009, 03:48 PM   #85
ekaser
Opinion Artiste
ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.
 
ekaser's Avatar
 
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
Quote:
Originally Posted by ahi View Post
The correct/ideal output would be:

\texit{text\textbf{text}}\textbf{text}

i.e.: the second one, as per your examples above.

Why? Simply because our progression is from left to right, I think.
Which certainly generates the 'smallest' output code, and most efficient. To do that, of course, as someone else mentioned, means that you have to keep a "format stack" so that you know which format was turned on in which order, so that you can properly "back them off" in the right order.

Of course, as you're aware, there WILL be "improperly formatted" HTML/RTF input files where the italics/bold formatting overlaps and they're not turned off/on cleanly. In that case, you SHOULD convert it to clean formatting. ie, using ()'s instead of []'s for visuals:
text(I)text(B)text(/I)text(/B) ==> texttexttexttext
should be "cleaned up" to:
text(I)text(B)text(/B)(/I)(B)text(/B) ==> texttexttexttext
At least, IMHO.
ekaser is offline   Reply With Quote
Old 09-24-2009, 03:57 PM   #86
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by ekaser View Post
Which certainly generates the 'smallest' output code, and most efficient. To do that, of course, as someone else mentioned, means that you have to keep a "format stack" so that you know which format was turned on in which order, so that you can properly "back them off" in the right order.

Of course, as you're aware, there WILL be "improperly formatted" HTML/RTF input files where the italics/bold formatting overlaps and they're not turned off/on cleanly. In that case, you SHOULD convert it to clean formatting. ie, using ()'s instead of []'s for visuals:
text(I)text(B)text(/I)text(/B) ==> texttexttexttext
should be "cleaned up" to:
text(I)text(B)text(/B)(/I)(B)text(/B) ==> texttexttexttext
At least, IMHO.
Remember, ekaser, some of this will happen automagically from (1) the way I keep track of formatting [i.e.: the parallel stream simplifies stuff to begin with] and (2) the formatting normalization plugin [which simplifies stuff a bit further... mostly by blanking formatting for newline characters and spaces standing between non-same formatted other characters].

But yeah... I'll give the format stack solution a shot and see what I manage.

- Ahi
ahi is offline   Reply With Quote
Old 09-28-2009, 12:09 PM   #87
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
The .tex output works fine now. I'm moving on to getting the HTML output to work at least as well as the .tex one does.

After that... I want to add some minimal image handling, and some interactive chapter "detection" to help mold the output.

Once I have those, I will upload.

In the meantime, if anybody has suggestions with regards to how I should handle tables... keeping in mind that my internal representation is basically plaintext with formatting/classification information attached on a character by character basis.

- Ahi
ahi is offline   Reply With Quote
Old 09-28-2009, 12:25 PM   #88
frabjous
Wizard
frabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameterfrabjous can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
frabjous's Avatar
 
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
Can you flag a character as "beginning a new cell" and/or "beginning a new row"? (Inserting a tab before the former, and a linefeed before the latter may be sufficient for plain text output.)

I think the basic idea of the script is consistent with simply stripping things like the lines and border styles around the tables and between the cells. Losing column alignment is a bit more of a cost, but maybe that can be preserved somehow?
frabjous is offline   Reply With Quote
Old 09-28-2009, 01:46 PM   #89
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by frabjous View Post
Can you flag a character as "beginning a new cell" and/or "beginning a new row"? (Inserting a tab before the former, and a linefeed before the latter may be sufficient for plain text output.)

I think the basic idea of the script is consistent with simply stripping things like the lines and border styles around the tables and between the cells. Losing column alignment is a bit more of a cost, but maybe that can be preserved somehow?
Alignment in general is a bit of an issue...

Bold and italic text is the sort of thing that one can reasonably assume that the source documents uses "correctly" (for a reasonably broad definition of "correct"). Alignment tomfoolery, however, is used for different things that *correctly* ought to be handled in different ways.

Just in the eBooks I've been playing around with thus far...

Centred text can mean a chapter, a subtitle, a chapter summary, book metadata, et cetera.

Right-aligned text can mean an epigraph, a signature, a date, et cetera.

When outputting HTML, arguably the limitations of the output format mean that simply centering or right-aligning the text as it was in the source is good enough. But for LaTeX output, it would be much preferable to handle each of those different things correctly in terms of the LaTeX's memoir class.

Admittedly perhaps cell alignment in a table is on par with bold/italic formatting in a paragraph... one can trust that it is correct as is, and needs no context-dependent special handling.

I think I need to rethink how the formatting/classification is handled. (Fortunately it won't be too much work to fix/update.)

I think I need to separate formatting from classification (and from footnotes/annotations/et cetera) like I originally intended. Formatting needs to be handled and mangled on its own, unfettered by miscellaneous non-formatting stuff.

I am actually starting to think that the power of pacify will ultimately derive from the simplicity of its approach of dealing with (mostly) one thing at a time: either the text, the formatting, or the content classification.

---

And, to answer your question, yes, marking table structure/table cells in the classification layer/stream is probably the right approach... which takes pacify toward its natural conclusion of using the text and formatting layer to generate the classification layer, but using only the text and classification layer (i.e.: not the formatting layer) for generating its output. For the simplest stuff (bold/italics) the formatting and classification layer will more or less encode the same information, but the classification layer should ultimately know even chapters, poems, et cetera from regular text.

- Ahi

Last edited by ahi; 09-28-2009 at 01:50 PM.
ahi is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Best pdf to text/rtf/whatever I have ever seen jblitereader Ectaco jetBook 13 07-10-2010 12:02 AM
RTF and TEXT conversion spaze Calibre 4 08-23-2009 03:11 AM
Automatic .Lit extractor for the iLiad Adam B. iRex 34 09-25-2008 07:20 PM
kovidgoyal: templatemaker -- automatic data extractor sammykrupa Sony Reader 1 07-21-2007 01:52 PM
Text to RTF question. Roy White Sony Reader 0 05-12-2007 06:59 PM


All times are GMT -4. The time now is 04:42 PM.


MobileRead.com is a privately owned, operated and funded community.