09-24-2009, 10:24 AM | #76 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
I suppose I could cop-out of this problem by generating no braces, and \itshape{} \bfseries{} for the start of formatting sections and \upshape{} and \mdseries{} for the closing of formatting sections.
Not very elegant though... - Ahi |
09-24-2009, 10:43 AM | #77 |
frumious Bandersnatch
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
How would you know the "correct" output is :
Code:
This \textit{is} \textbf{\textit{in}deed} a strange idea! Code:
This \textit{is \textbf{in}}\textbf{deed} a strange idea! Of course, the real output, after a LaTeX run, would be indistinguishable. (Note that whether the whitespace must be italic or not may be debatable, but you should probably keep whatever was in the input file) I don't know what would be the "canonical" way of dealing with this, but I'd say you'll have to check proper nesting when generating LaTeX code: Whenever a feature is deactivated, check if it's the innermost feature (the last one to have been activated; if it is, close the brace, if it isn't, close the the inner features' braces, close the brace, and open the inner features again. |
09-24-2009, 10:44 AM | #78 |
Wizard
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
Could you have the program keep track of how many open braces there are, and then, when there's any kind of change, close them all, and then reopen the ongoing ones?
To get: This is indeed a strange idea. Then, you would have: Code:
This \textit{is }\textit{\textbf{in}}\textbf{deed} a strange idea. This is a more normal sentence. you'd get: Code:
This \textit{is a }\textit{\textbf{more}}\textit{ normal} sentence. Code:
This \textit{is a \textbf{more} normal} sentence. I would think you'd want to do something similar for HTML anyway, since overlapping rather than nested tags, such as: Code:
This <i>is <b>in</i>deed</b> a strange idea. and it would be better to have: Code:
This <i>is </i><i><b>in</b></i><b>deed</b> a strange idea. |
09-24-2009, 11:09 AM | #79 | |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
I'm a little surprised that doing the ideal thing seems to be a fairly non-straightforward problem. I'm happy to report though that the development version I am working on really seems to be free of unicode errors, and is shaping up to work remarkably well. Thanks to HTML's <H1> ... <H6> tags, pacify.py should be able to convert cleanly formatted HTML files well-nigh directly into PDF via LaTeX. I'm also on the verge of starting to add interactive processing algorithms... (which do clean-up and/or address ambiguous cases after automated processing, and which can be disabled) The first interactive plugin (or rather interactive portion of a plugin) will be for detecting errors/problems with auto-smartened quotation marks. i.e.: If number of open quotation marks and closed quotation marks do not add up [unless it's a multi-paragraph quotation] or open/close incorrectly, ask the user for advice on what to do. The second one I plan to work on will try to autodetect chapter/setion/et cetera headers when they are imported from RTF or plaintext files (in which cases they are not as unambiguous as when imported from HTML that uses H1 ... H6). - Ahi Last edited by ahi; 09-24-2009 at 11:37 AM. |
|
09-24-2009, 11:19 AM | #80 | ||
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
My example was therefore incorrect. But the same problem exists with this slight modification that is legal/plausible within the framework of pacify. Code:
T h i s i s i n d e e d a s t r a n g e i d e a !
-- -- -- -- -- -I -I -I -I BI BI B- B- B- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Quote:
- Ahi Last edited by ahi; 09-24-2009 at 11:36 AM. |
||
09-24-2009, 12:07 PM | #81 |
frumious Bandersnatch
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Note that the real intent of my example was to compare:
Code:
\textit{italic} \textbf{\textit{bold italic} bold} Code:
\textit{italic \textbf{bold italic}} \textbf{bold} |
09-24-2009, 12:29 PM | #82 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
This is a substantially different example now though... with four (space-separated) holistically formatted words, instead of two words with formatting change mid-word for one.
I believe current incarnations of my program would generate: Code:
\textit{italic} \textbf{\textit{bold italic}} \textbf{bold} or Code:
\textit{italic} \textit{\textbf{bold italic}} \textbf{bold} So... neither, I guess? The formatting preprocessor would "blank" the formatting of the space after the first italic, and the space before the second bold due to a lack of uniform formatting on both sides of the space character. - Ahi |
09-24-2009, 01:55 PM | #83 |
frumious Bandersnatch
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
OK, OK... I only put the spaces there to make it clearer, but it seem's I'm just messing it. This is what I mean:
texttexttext texttexttext Both look the same, and share the same "formatting bits", but are coded differently (use the "quote" button to see it). |
09-24-2009, 02:05 PM | #84 | |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
\texit{text\textbf{text}}\textbf{text} i.e.: the second one, as per your examples above. Why? Simply because our progression is from left to right, I think. - Ahi |
|
09-24-2009, 03:48 PM | #85 | |
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Quote:
Of course, as you're aware, there WILL be "improperly formatted" HTML/RTF input files where the italics/bold formatting overlaps and they're not turned off/on cleanly. In that case, you SHOULD convert it to clean formatting. ie, using ()'s instead of []'s for visuals: text(I)text(B)text(/I)text(/B) ==> texttexttexttext should be "cleaned up" to:text(I)text(B)text(/B)(/I)(B)text(/B) ==> texttexttexttext At least, IMHO.
|
|
09-24-2009, 03:57 PM | #86 | |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
But yeah... I'll give the format stack solution a shot and see what I manage. - Ahi |
|
09-28-2009, 12:09 PM | #87 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
The .tex output works fine now. I'm moving on to getting the HTML output to work at least as well as the .tex one does.
After that... I want to add some minimal image handling, and some interactive chapter "detection" to help mold the output. Once I have those, I will upload. In the meantime, if anybody has suggestions with regards to how I should handle tables... keeping in mind that my internal representation is basically plaintext with formatting/classification information attached on a character by character basis. - Ahi |
09-28-2009, 12:25 PM | #88 |
Wizard
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
Can you flag a character as "beginning a new cell" and/or "beginning a new row"? (Inserting a tab before the former, and a linefeed before the latter may be sufficient for plain text output.)
I think the basic idea of the script is consistent with simply stripping things like the lines and border styles around the tables and between the cells. Losing column alignment is a bit more of a cost, but maybe that can be preserved somehow? |
09-28-2009, 01:46 PM | #89 | |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
Bold and italic text is the sort of thing that one can reasonably assume that the source documents uses "correctly" (for a reasonably broad definition of "correct"). Alignment tomfoolery, however, is used for different things that *correctly* ought to be handled in different ways. Just in the eBooks I've been playing around with thus far... Centred text can mean a chapter, a subtitle, a chapter summary, book metadata, et cetera. Right-aligned text can mean an epigraph, a signature, a date, et cetera. When outputting HTML, arguably the limitations of the output format mean that simply centering or right-aligning the text as it was in the source is good enough. But for LaTeX output, it would be much preferable to handle each of those different things correctly in terms of the LaTeX's memoir class. Admittedly perhaps cell alignment in a table is on par with bold/italic formatting in a paragraph... one can trust that it is correct as is, and needs no context-dependent special handling. I think I need to rethink how the formatting/classification is handled. (Fortunately it won't be too much work to fix/update.) I think I need to separate formatting from classification (and from footnotes/annotations/et cetera) like I originally intended. Formatting needs to be handled and mangled on its own, unfettered by miscellaneous non-formatting stuff. I am actually starting to think that the power of pacify will ultimately derive from the simplicity of its approach of dealing with (mostly) one thing at a time: either the text, the formatting, or the content classification. --- And, to answer your question, yes, marking table structure/table cells in the classification layer/stream is probably the right approach... which takes pacify toward its natural conclusion of using the text and formatting layer to generate the classification layer, but using only the text and classification layer (i.e.: not the formatting layer) for generating its output. For the simplest stuff (bold/italics) the formatting and classification layer will more or less encode the same information, but the classification layer should ultimately know even chapters, poems, et cetera from regular text. - Ahi Last edited by ahi; 09-28-2009 at 01:50 PM. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Best pdf to text/rtf/whatever I have ever seen | jblitereader | Ectaco jetBook | 13 | 07-10-2010 12:02 AM |
RTF and TEXT conversion | spaze | Calibre | 4 | 08-23-2009 03:11 AM |
Automatic .Lit extractor for the iLiad | Adam B. | iRex | 34 | 09-25-2008 07:20 PM |
kovidgoyal: templatemaker -- automatic data extractor | sammykrupa | Sony Reader | 1 | 07-21-2007 01:52 PM |
Text to RTF question. | Roy White | Sony Reader | 0 | 05-12-2007 06:59 PM |