09-01-2009, 09:48 AM | #16 | |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
My hope is to make pacify into a tool that can alleviate a lot of eBook preparation "monkey work" from my shoulders. But yes, it's important to be clear which operations are reasonably likely to be foolproof, and which ones are sure to require thorough looking-over by a human being. If somebody wants all internal dialogue typeset though... even fixing just 50% through poor pattern-matching/semi-automation reduces the outstanding work considerably. Not something I'd ever do for myself, to be honest... but, so long as one understands what one is looking for, and what the limitations are, probably not hard to throw together something that would be helpful. - Ahi |
|
09-01-2009, 10:18 AM | #17 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Can I run something by you, Jellby? (And by whoever else may be reading.)
A lot of what I am trying to do with pacify.py is going to be text processing... but at the same time, I do want to be able to handle some light formatting--bold, italics, maybe a bit more. Unfortunately any obvious/straightforward way of handling formatting interferes with the straightforwardness of any text processing. e.g.: Once I have html tags, or html entities, or latex commands in there... it begins to get harder to find out what the first character of the subsequent paragraph is, for example, on account of having to escape the formatting portions. I have a vague idea in my head about creating a class in python that would facilitate both formatting and text processing concerns, by keeping content in the following manner: For any string of length X, it would store two strings of length X. The first stored string would be the plaintext, the second stored string would be byte-long bitfields that provide formatting information. Or, to give a dumbed-down view, instead of: Code:
Isn't <i>that</i> the reason we're <b>here</b>?
Code:
String 1: Isn't that the reason we're here?
String 2: 000000IIII000000000000000000BBBB0
What do you think? Any chance that this is a better way than the obvious alternative of using HTML or something similar internally? - Ahi |
Advert | |
|
09-01-2009, 11:11 AM | #18 |
frumious Bandersnatch
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
I won't claim I have any authority in that field, because I haven't. I have no real experience in programming (other than some scientific samples in fortran). But two thoughts occur to me:
1. Isn't that roughly what the recent patent conflict with MS-Word was about? 2. Wouldn't you need a too large "byte" size for the format string? It's simple for just italic and bold, but how do you store bold-italic? How do you store bold, italic, underlined, red and large size? If your goal is supporting only basic stuff (like just bold and italic) then it's probably fine, but I suspect almost any other alternative would be equally fine... |
09-01-2009, 11:12 AM | #19 | |
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Quote:
Every data storage method has its advantages and disadvantages. For this type of data-stream/formatting combination, you've pretty much got: 1) in-stream (data and formatting mixed in same stream of bytes) 2) parallel streams (what you're considering) 3) in-stream flags (a combo of 1) and 2) with wider 'bytes' (WORDS or DWORDS) with flags in the upper bits. 4) packets (blocks of text with common formatting) 5) stream and heap-of-format-pointers and probably several other convoluted methods. Which works best depends a great deal upon what your 'application' needs to accomplish. An application that primarily has to DISPLAY the data might work better with 4) or 5), whereas an application that does NOT need to display the data will probably work better with one of the others, and which one of them will depend upon the nature of the processing that's being done. 4) and 5) are more memory efficient, but more code complex. For what I THINK it is you're trying to accomplish (primarily file format shifting of fairly simple text files), then what you suggest should work quite well, since memory usage is generally no longer such an issue. When memory was less ... abundant, then code complexity was often the sacrificial lamb to memory usage. |
|
09-01-2009, 11:18 AM | #20 | |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
Regarding 2, the second "string" could be a list instead, if need be, with the number of the list item corresponding to the byte-position in the plaintext. But a single byte, used as a bitfield, is sufficient for 8 distinct on-or-off states. My primary aim at this time is to convert RTF into HTML or LaTeX. Given that some of those RTFs have a lot of extraneous formatting information (usually relating to minimally [and needlessly] varying font-size, and similar things) that would be literally harmful to include in the output in most cases, I would probably focus only on bold, italics, small caps, and colour. With such a combination, the output would be reasonably clean, contain no excess/disruptive (mis)formatting, and yield itself well to trying to figure out what is regular text, and what is something other than. I should probably include font-size in the formatting list as well... but I'm almost certain I don't need exact font sizes, but rather a more fuzzy determination as to whether the font size is small, regular, or large. - Ahi |
|
Advert | |
|
09-01-2009, 11:31 AM | #21 | |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
I am using UTF-8 presently... but might ultimately need to switch to literally using lists cart-blanche instead of strings, as some of my processing needs relate to Unicode Extension Plane B Chinese characters that use 3 (or 4? I forget) bytes in UTF-8 to represent a single display-character... and python, even in UTF-8 mode, treats them at least as two separate characters. If I go this route, doubtless I will be choosing reliability over speed in a big way. And actually, while you're here, though memory shouldn't be an issue... I have 3 GB RAM and 32 GB swap space under my Linux setup, I keep getting Python memory errors when trying to process RTF files between 400 MB - 1 GB in size. Yes, yes... "Duh!", I know. But is there a way to increase Python's memory limit? My system can obviously take it... as less than a tenth of the swap is used before Python dies... so I'd really like to force Python to process these huge behemoths... even if it slows my system to a crawl for minutes or even hours. Any tips for me, ekaser? - Ahi |
|
09-01-2009, 11:58 AM | #23 | |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
And the formatting instructions wouldn't even map 1:1 to either HTML or XML... the latter, because, I have no plans to generate XML with pacify at all. Certainly a bit too similar for comfort, Jellby. Thanks for pointing it out. - Ahi |
|
09-01-2009, 12:03 PM | #24 | |
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Quote:
|
|
09-01-2009, 12:10 PM | #25 | |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
But you are right. It shouldn't be too hard to do. - Ahi |
|
09-01-2009, 12:19 PM | #26 | |
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Quote:
1) I really think it's not close enough to matter. 2) As long as you're not planning on selling it, no one cares. 3) If you do sell it, you're not going to make enough money for anyone to care. 4) If you do sell it and make as much money as Microsoft, YOU won't care. Lose no sleep. IMHO. (Software patents spew, but that's just my opinion...) |
|
09-01-2009, 12:24 PM | #27 |
frumious Bandersnatch
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
I basically agree with ekaser. I didn't mention the patent issue to scare you, I just thought the relation was interesting, like a déjà vu sort of thing.
|
09-01-2009, 12:28 PM | #28 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Thanks, both of you! And yes, it is interesting.
It just occurred to me that this sort of approach to formatting might also effectively simplify formatting code. I have RTF files that seem full of unnecessary bold commands... bold being set 5-6 times in a row, for what is basically a single contiguous bolded portion. Converted into this parallel stream format, when time comes to reencode into HTML or LaTeX, that sort of a mess would only generate a single <b>...</b> or \textbf{...}. Which is fortuitous, because I've also been thinking of how formatting could be refactored so as to remove redundancy. - Ahi |
09-01-2009, 01:29 PM | #29 | |
Opinion Artiste
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
|
Quote:
EDIT: Note, I'm not suggesting you include a 'general' HTML or LaTeX parser, anymore than you're going to have a general RTF parser, just the "basic stuff" that you want to keep and throw everything else away. Sure, some files it would make a mess of, but those files probably wouldn't be appropriate for this style of conversion either. I'm assuming this is aimed at "simple novel" types of books that don't have a lot of fancy formatting to start with. One thought: since you're taking RTF as input files, some of those will have images (covers, maps, etc), so I'm hoping that those image tags would be maintained along with the bold, italic, etc, right? That would imply the need to be able to include a "numbered mark" in the formatting string. Perhaps if the most significant bit of the formatting 'character' was set, then the lower bits are the 'number' of the image (on the "image stack") that should be inserted at that point. Of course, that then also brings up the question of image positioning: left, center, right. Last edited by ekaser; 09-01-2009 at 01:38 PM. |
|
09-01-2009, 01:42 PM | #30 | |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
I've just been itching to get working code and, in retrospect, have churned out messy and architecturally lazy code in order to do so. I think I need to redo the architecture in order to have a clean "intake" portion that converts input files into the internal format, a processing portion that does whatever needs to be done, and an output portion that converts the internal format into the chosen output format. Once I have that, a good deal of the existing code can be plugged in there without too much modification, and adding additional input or output formats can also be done without any complication of the existing code. And I can already think of somebody who would be very pleased to have minimalist RTFs generated from existing more complex ones. - Ahi |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Best pdf to text/rtf/whatever I have ever seen | jblitereader | Ectaco jetBook | 13 | 07-10-2010 12:02 AM |
RTF and TEXT conversion | spaze | Calibre | 4 | 08-23-2009 03:11 AM |
Automatic .Lit extractor for the iLiad | Adam B. | iRex | 34 | 09-25-2008 07:20 PM |
kovidgoyal: templatemaker -- automatic data extractor | sammykrupa | Sony Reader | 1 | 07-21-2007 01:52 PM |
Text to RTF question. | Roy White | Sony Reader | 0 | 05-12-2007 06:59 PM |