Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 09-01-2009, 09:48 AM   #16
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by Jellby View Post
No, it was an example of a likely false positive with the rules above

Granted, you can eliminate false positives with your two pass method, but there could be literally hundreds of them, often many more than real "internal dialogue" phrases.

As for false negatives, I often find dialogues (internal or not) that just omit the "he said", "she thought", etc. words. One should also look for "he said to himself" or "he wondered", or "he secretly admited", etc.

An automated tool can be of some help, but the danger is letting the user rely solely on the tool, which can be worse than just leaving the "internal dialogues" unformatted. Similarly, when I see curly quotes wrongly oriented I would prefer they had been left as straight quotes instead.
You are completely right.

My hope is to make pacify into a tool that can alleviate a lot of eBook preparation "monkey work" from my shoulders. But yes, it's important to be clear which operations are reasonably likely to be foolproof, and which ones are sure to require thorough looking-over by a human being.

If somebody wants all internal dialogue typeset though... even fixing just 50% through poor pattern-matching/semi-automation reduces the outstanding work considerably.

Not something I'd ever do for myself, to be honest... but, so long as one understands what one is looking for, and what the limitations are, probably not hard to throw together something that would be helpful.

- Ahi
ahi is offline   Reply With Quote
Old 09-01-2009, 10:18 AM   #17
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Can I run something by you, Jellby? (And by whoever else may be reading.)

A lot of what I am trying to do with pacify.py is going to be text processing... but at the same time, I do want to be able to handle some light formatting--bold, italics, maybe a bit more.

Unfortunately any obvious/straightforward way of handling formatting interferes with the straightforwardness of any text processing. e.g.: Once I have html tags, or html entities, or latex commands in there... it begins to get harder to find out what the first character of the subsequent paragraph is, for example, on account of having to escape the formatting portions.

I have a vague idea in my head about creating a class in python that would facilitate both formatting and text processing concerns, by keeping content in the following manner:

For any string of length X, it would store two strings of length X. The first stored string would be the plaintext, the second stored string would be byte-long bitfields that provide formatting information.

Or, to give a dumbed-down view, instead of:

Code:
Isn't <i>that</i> the reason we're <b>here</b>?
would be:

Code:

String 1: Isn't that the reason we're here?
String 2: 000000IIII000000000000000000BBBB0
And then any operation done on the plaintext (via the class's methods) would perform the equivalent operation on the formatting string. This way content and formatting could be dealt with separately without having to painstakingly escape formatting instructions for any text-processing operation.

What do you think? Any chance that this is a better way than the obvious alternative of using HTML or something similar internally?

- Ahi
ahi is offline   Reply With Quote
Advert
Old 09-01-2009, 11:11 AM   #18
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
I won't claim I have any authority in that field, because I haven't. I have no real experience in programming (other than some scientific samples in fortran). But two thoughts occur to me:

1. Isn't that roughly what the recent patent conflict with MS-Word was about?

2. Wouldn't you need a too large "byte" size for the format string? It's simple for just italic and bold, but how do you store bold-italic? How do you store bold, italic, underlined, red and large size? If your goal is supporting only basic stuff (like just bold and italic) then it's probably fine, but I suspect almost any other alternative would be equally fine...
Jellby is offline   Reply With Quote
Old 09-01-2009, 11:12 AM   #19
ekaser
Opinion Artiste
ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.
 
ekaser's Avatar
 
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
Quote:
Originally Posted by ahi View Post
A lot of what I am trying to do with pacify.py is going to be text processing... but at the same time, I do want to be able to handle some light formatting--bold, italics, maybe a bit more.

Unfortunately any obvious/straightforward way of handling formatting interferes with the straightforwardness of any text processing.

I have a vague idea in my head about creating a class in python that would facilitate both formatting and text processing concerns, by keeping content in the following manner:

For any string of length X, it would store two strings of length X. The first stored string would be the plaintext, the second stored string would be byte-long bitfields that provide formatting information.

And then any operation done on the plaintext (via the class's methods) would perform the equivalent operation on the formatting string. This way content and formatting could be dealt with separately without having to painstakingly escape formatting instructions for any text-processing operation.
Are you aiming this completely at English? If not, if you think you or someone else might want to adapt it to other languages at some point, you might want to use WORD arrays from the start rather than BYTE arrays, so that UNICODE or other character sets could be adopted at some point more easily. That would also give you a few more "formatting options" with 16 flags instead of just 8.

Every data storage method has its advantages and disadvantages. For this type of data-stream/formatting combination, you've pretty much got:
1) in-stream (data and formatting mixed in same stream of bytes)
2) parallel streams (what you're considering)
3) in-stream flags (a combo of 1) and 2) with wider 'bytes' (WORDS or DWORDS) with flags in the upper bits.
4) packets (blocks of text with common formatting)
5) stream and heap-of-format-pointers

and probably several other convoluted methods. Which works best depends a great deal upon what your 'application' needs to accomplish. An application that primarily has to DISPLAY the data might work better with 4) or 5), whereas an application that does NOT need to display the data will probably work better with one of the others, and which one of them will depend upon the nature of the processing that's being done. 4) and 5) are more memory efficient, but more code complex.

For what I THINK it is you're trying to accomplish (primarily file format shifting of fairly simple text files), then what you suggest should work quite well, since memory usage is generally no longer such an issue. When memory was less ... abundant, then code complexity was often the sacrificial lamb to memory usage.
ekaser is offline   Reply With Quote
Old 09-01-2009, 11:18 AM   #20
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by Jellby View Post
I won't claim I have any authority in that field, because I haven't. I have no real experience in programming (other than some scientific samples in fortran). But two thoughts occur to me:

1. Isn't that roughly what the recent patent conflict with MS-Word was about?

2. Wouldn't you need a too large "byte" size for the format string? It's simple for just italic and bold, but how do you store bold-italic? How do you store bold, italic, underlined, red and large size? If your goal is supporting only basic stuff (like just bold and italic) then it's probably fine, but I suspect almost any other alternative would be equally fine...
Can you tell me more about (1)? I'm oblivious.

Regarding 2, the second "string" could be a list instead, if need be, with the number of the list item corresponding to the byte-position in the plaintext. But a single byte, used as a bitfield, is sufficient for 8 distinct on-or-off states.

My primary aim at this time is to convert RTF into HTML or LaTeX. Given that some of those RTFs have a lot of extraneous formatting information (usually relating to minimally [and needlessly] varying font-size, and similar things) that would be literally harmful to include in the output in most cases, I would probably focus only on bold, italics, small caps, and colour. With such a combination, the output would be reasonably clean, contain no excess/disruptive (mis)formatting, and yield itself well to trying to figure out what is regular text, and what is something other than.

I should probably include font-size in the formatting list as well... but I'm almost certain I don't need exact font sizes, but rather a more fuzzy determination as to whether the font size is small, regular, or large.

- Ahi
ahi is offline   Reply With Quote
Advert
Old 09-01-2009, 11:31 AM   #21
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by ekaser View Post
Are you aiming this completely at English? If not, if you think you or someone else might want to adapt it to other languages at some point, you might want to use WORD arrays from the start rather than BYTE arrays, so that UNICODE or other character sets could be adopted at some point more easily. That would also give you a few more "formatting options" with 16 flags instead of just 8.

Every data storage method has its advantages and disadvantages. For this type of data-stream/formatting combination, you've pretty much got:
1) in-stream (data and formatting mixed in same stream of bytes)
2) parallel streams (what you're considering)
3) in-stream flags (a combo of 1) and 2) with wider 'bytes' (WORDS or DWORDS) with flags in the upper bits.
4) packets (blocks of text with common formatting)
5) stream and heap-of-format-pointers

and probably several other convoluted methods. Which works best depends a great deal upon what your 'application' needs to accomplish. An application that primarily has to DISPLAY the data might work better with 4) or 5), whereas an application that does NOT need to display the data will probably work better with one of the others, and which one of them will depend upon the nature of the processing that's being done. 4) and 5) are more memory efficient, but more code complex.

For what I THINK it is you're trying to accomplish (primarily file format shifting of fairly simple text files), then what you suggest should work quite well, since memory usage is generally no longer such an issue. When memory was less ... abundant, then code complexity was often the sacrificial lamb to memory usage.
I oversimplified without saying so. I am in fact aiming this to be reasonably international... or at least have the potential to be so.

I am using UTF-8 presently... but might ultimately need to switch to literally using lists cart-blanche instead of strings, as some of my processing needs relate to Unicode Extension Plane B Chinese characters that use 3 (or 4? I forget) bytes in UTF-8 to represent a single display-character... and python, even in UTF-8 mode, treats them at least as two separate characters. If I go this route, doubtless I will be choosing reliability over speed in a big way.

And actually, while you're here, though memory shouldn't be an issue... I have 3 GB RAM and 32 GB swap space under my Linux setup, I keep getting Python memory errors when trying to process RTF files between 400 MB - 1 GB in size.

Yes, yes... "Duh!", I know. But is there a way to increase Python's memory limit? My system can obviously take it... as less than a tenth of the swap is used before Python dies... so I'd really like to force Python to process these huge behemoths... even if it slows my system to a crawl for minutes or even hours.

Any tips for me, ekaser?

- Ahi
ahi is offline   Reply With Quote
Old 09-01-2009, 11:36 AM   #22
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by ahi View Post
Can you tell me more about (1)? I'm oblivious.
I think it was this.
Jellby is offline   Reply With Quote
Old 09-01-2009, 11:58 AM   #23
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by Jellby View Post
I think it was this.
Hmmm... one would think it doesn't cover parallel streams in general. What do you think, ekaser?

And the formatting instructions wouldn't even map 1:1 to either HTML or XML... the latter, because, I have no plans to generate XML with pacify at all.

Certainly a bit too similar for comfort, Jellby. Thanks for pointing it out.

- Ahi
ahi is offline   Reply With Quote
Old 09-01-2009, 12:03 PM   #24
ekaser
Opinion Artiste
ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.
 
ekaser's Avatar
 
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
Quote:
Originally Posted by ahi View Post
Though memory shouldn't be an issue... I have 3 GB RAM and 32 GB swap space under my Linux setup, I keep getting Python memory errors when trying to process RTF files between 400 MB - 1 GB in size.

Yes, yes... "Duh!", I know. But is there a way to increase Python's memory limit? My system can obviously take it... as less than a tenth of the swap is used before Python dies... so I'd really like to force Python to process these huge behemoths... even if it slows my system to a crawl for minutes or even hours.
Sorry, I'm not a Python expert (just starting on it, really, I'm a long-time C guy), so I can't help you with Python. But the old, tried and true method, is to not read the whole thing in all at once, read it in chunks, process that chunk, when you get "close to the end" of the chunk, move it up and refill the queue with the next chunk from the file and keep going. Of course, that works better with some 'things' than others, but I would think it would work reasonably well with .rtf text files, which are pretty linear beasts. You might have to keep around a 'stack' of "open blocks" for text that's long since been flushed from the processing queue, so that you know what's pending when you reach the end of that block in the queue, but probably not. If you make the processing queue sufficiently large (4M? 8M? 16M 32M? any of those would probably be plenty big and would avoid the "memory issues"), then you could update/refill the queue at opportune moments. In managing the queue, you can either move the unused portion up and then refill from there to the end of the queue, or just keep pointers to the start and end of the unprocessed portion, and refilling the queue then involves two reads, the first to fill the tail-end portion of the unfilled queue and the second to fill the front-end unused portion. If/when speed of processing is not an issue (which I don't think it is in your case), then move-and-fill is preferred, because it makes the rest of the code MUCH simpler. With "rotating pointers", you're constantly checking for reaching the end of the queue and whether the end pointer is greater or lesser than the start pointer and such. PITA.
ekaser is offline   Reply With Quote
Old 09-01-2009, 12:10 PM   #25
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by ekaser View Post
Sorry, I'm not a Python expert (just starting on it, really, I'm a long-time C guy), so I can't help you with Python. But the old, tried and true method, is to not read the whole thing in all at once, read it in chunks, process that chunk, when you get "close to the end" of the chunk, move it up and refill the queue with the next chunk from the file and keep going. Of course, that works better with some 'things' than others, but I would think it would work reasonably well with .rtf text files, which are pretty linear beasts. You might have to keep around a 'stack' of "open blocks" for text that's long since been flushed from the processing queue, so that you know what's pending when you reach the end of that block in the queue, but probably not. If you make the processing queue sufficiently large (4M? 8M? 16M 32M? any of those would probably be plenty big and would avoid the "memory issues"), then you could update/refill the queue at opportune moments. In managing the queue, you can either move the unused portion up and then refill from there to the end of the queue, or just keep pointers to the start and end of the unprocessed portion, and refilling the queue then involves two reads, the first to fill the tail-end portion of the unfilled queue and the second to fill the front-end unused portion. If/when speed of processing is not an issue (which I don't think it is in your case), then move-and-fill is preferred, because it makes the rest of the code MUCH simpler. With "rotating pointers", you're constantly checking for reaching the end of the queue and whether the end pointer is greater or lesser than the start pointer and such. PITA.
I might have to resort to that... I'm just irked that I have to do any workaround for code that clearly works fine with smaller files, and even with the larger files are not actually exceeding (or even approaching) my hardware's memory limits.

But you are right. It shouldn't be too hard to do.

- Ahi
ahi is offline   Reply With Quote
Old 09-01-2009, 12:19 PM   #26
ekaser
Opinion Artiste
ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.
 
ekaser's Avatar
 
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
Quote:
Originally Posted by ahi View Post
Hmmm... one would think it doesn't cover parallel streams in general. What do you think, ekaser?

And the formatting instructions wouldn't even map 1:1 to either HTML or XML... the latter, because, I have no plans to generate XML with pacify at all.
IANAL, but:
1) I really think it's not close enough to matter.
2) As long as you're not planning on selling it, no one cares.
3) If you do sell it, you're not going to make enough money for anyone to care.
4) If you do sell it and make as much money as Microsoft, YOU won't care.

Lose no sleep. IMHO.

(Software patents spew, but that's just my opinion...)
ekaser is offline   Reply With Quote
Old 09-01-2009, 12:24 PM   #27
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
I basically agree with ekaser. I didn't mention the patent issue to scare you, I just thought the relation was interesting, like a déjà vu sort of thing.
Jellby is offline   Reply With Quote
Old 09-01-2009, 12:28 PM   #28
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Thanks, both of you! And yes, it is interesting.

It just occurred to me that this sort of approach to formatting might also effectively simplify formatting code.

I have RTF files that seem full of unnecessary bold commands... bold being set 5-6 times in a row, for what is basically a single contiguous bolded portion. Converted into this parallel stream format, when time comes to reencode into HTML or LaTeX, that sort of a mess would only generate a single <b>...</b> or \textbf{...}.

Which is fortuitous, because I've also been thinking of how formatting could be refactored so as to remove redundancy.

- Ahi
ahi is offline   Reply With Quote
Old 09-01-2009, 01:29 PM   #29
ekaser
Opinion Artiste
ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.ekaser has a propeller beanie that spins backward.
 
ekaser's Avatar
 
Posts: 301
Karma: 61464
Join Date: Mar 2009
Location: Albany, OR
Device: Nexus 5, Nexus 7, Kindle Touch, Kindle Fire
Quote:
Originally Posted by ahi View Post
It just occurred to me that this sort of approach to formatting might also effectively simplify formatting code.
Very true, a built-in form of TIDY. It might be worthwhile to add HTML input (LaTeX, I don't care about personally, and I'm not sure how many folks really use it or would use it, but that's neither here nor there). Right now you have .TXT and .RTF input and .HTML and LaTeX output. Why not make it 'orthogonal', all any of the four input and any of the four output. That would, in essence, make it a 'tidy' and conversion program all in one. Just a thought.

EDIT: Note, I'm not suggesting you include a 'general' HTML or LaTeX parser, anymore than you're going to have a general RTF parser, just the "basic stuff" that you want to keep and throw everything else away. Sure, some files it would make a mess of, but those files probably wouldn't be appropriate for this style of conversion either. I'm assuming this is aimed at "simple novel" types of books that don't have a lot of fancy formatting to start with. One thought: since you're taking RTF as input files, some of those will have images (covers, maps, etc), so I'm hoping that those image tags would be maintained along with the bold, italic, etc, right? That would imply the need to be able to include a "numbered mark" in the formatting string. Perhaps if the most significant bit of the formatting 'character' was set, then the lower bits are the 'number' of the image (on the "image stack") that should be inserted at that point. Of course, that then also brings up the question of image positioning: left, center, right.

Last edited by ekaser; 09-01-2009 at 01:38 PM.
ekaser is offline   Reply With Quote
Old 09-01-2009, 01:42 PM   #30
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by ekaser View Post
Very true, a built-in form of TIDY. It might be worthwhile to add HTML input (LaTeX, I don't care about personally, and I'm not sure how many folks really use it or would use it, but that's neither here nor there). Right now you have .TXT and .RTF input and .HTML and LaTeX output. Why not make it 'orthogonal', all any of the four input and any of the four output. That would, in essence, make it a 'tidy' and conversion program all in one. Just a thought.
Probably the approach I will take, ekaser.

I've just been itching to get working code and, in retrospect, have churned out messy and architecturally lazy code in order to do so.

I think I need to redo the architecture in order to have a clean "intake" portion that converts input files into the internal format, a processing portion that does whatever needs to be done, and an output portion that converts the internal format into the chosen output format.

Once I have that, a good deal of the existing code can be plugged in there without too much modification, and adding additional input or output formats can also be done without any complication of the existing code.

And I can already think of somebody who would be very pleased to have minimalist RTFs generated from existing more complex ones.

- Ahi
ahi is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Best pdf to text/rtf/whatever I have ever seen jblitereader Ectaco jetBook 13 07-10-2010 12:02 AM
RTF and TEXT conversion spaze Calibre 4 08-23-2009 03:11 AM
Automatic .Lit extractor for the iLiad Adam B. iRex 34 09-25-2008 07:20 PM
kovidgoyal: templatemaker -- automatic data extractor sammykrupa Sony Reader 1 07-21-2007 01:52 PM
Text to RTF question. Roy White Sony Reader 0 05-12-2007 06:59 PM


All times are GMT -4. The time now is 07:17 PM.


MobileRead.com is a privately owned, operated and funded community.