Order it now! Amazon prioritizes orders on a first come, first served basis.


View Full Version : RTF conversion.


Riocaz
07-21-2006, 08:19 AM
Because it bugs me...

Anyone know of a tool which will strip font and size tags from an rtf file, but leaves the bold and italic tags in place?

It would aide me in converting my rtf files to html (as they bloat the end file if you convert straight. For example I stripped 1.4mb of unnessessary crud from one I was playign with yesterday and dropped the file size from 2.5MB to 1.1MB)

I know I saw such a tool when I was searching for an rtf/html conversion tool, but unfortunately I diddn't grab it at the time, and now cannot find it.

ElaHuguet
07-21-2006, 08:24 AM
There's a tool called... Tidy something-or-other (TidyHTML? TidyUI? I have it at home), which is really cool for cleaning up code, also works great for cleaning up general MS Word crud. It's free, so you can check it out. Just google it, I found it that way.

Riocaz
07-21-2006, 08:38 AM
Doh! why diddn't I think of that? I was so focussed on finding something to "fix" the input file, I diddn't think of "fixing" the output. (even though I was trying to do that manually).

That looks perfect, I will have a play and tell you all how it does.

ElaHuguet
07-21-2006, 08:55 AM
Hehehe... you're welcome, I found it easy to use.

meisterz
07-24-2006, 05:37 PM
Where can I find this Tidy?? program?

Thanks

branko
07-24-2006, 06:07 PM
It's the first link in Google. Do you know what Google is?

meisterz
07-25-2006, 09:27 AM
By the sarcasm I assume it is either tidyhtml http://www.tucows.com/preview/206197 or tidy ui http://www.forums.devnetwork.net/viewtopic.php?p=156958&sid=d0c4504a83e8d6475be6b53b800f978c

branko
07-25-2006, 01:59 PM
Sarcasm?

yokos
08-01-2006, 11:37 AM
If you are a fan of almighty LaTeX give rtf2LaTeX a try. It works fine. http://sourceforge.net/projects/rtf2latex2e

rlauzon
08-01-2006, 11:47 AM
Anyone know of a tool which will strip font and size tags from an rtf file, but leaves the bold and italic tags in place?

I usually use OpenOffice.

1. I convert the RTF into an HTML file.
2. Reload the HTML file back into OpenOffice.
3. I use the source view to do a Find/Replace on all the offending tags.
4. Then I convert the HTML into a regular OpenOffice file to save it.
5. And finally, I export to PDF to put it on my iLiad.

jsc
08-02-2006, 06:13 PM
I just got rtf2latex2e compiled for OS X and used it to convert a Baen RTF. It helped some, but I have to say it's simply not very good. I had to deal with a great number of unbalanced environment tags (italics started but never ended) and a large section of boldface which was not visible in the RTF. This may be because the RTF file included some formatting badness, but I think the above suggestion to use OpenOffice to convert to xhtml is better.

Unfortunately, the xhtml generated by OpenOffice uses CSS heavily, so it's not always obvious what markup to substitute. Italics is not done with an i tag, it is a p tag with a CSS class. Still, it puts it in a format that's at least workable. The final issue is to replace double and single quotes with appropriate text quotes, for which I'm working on a script to do heuristically (you can't just count on there being left-right quote pairs, since multiple paragraphs in quotes are traditionally started with but not ended by a text quote).

NatCh
08-02-2006, 06:33 PM
The final issue is to replace double and single quotes with appropriate text quotes
Okay, I'm not trying to be obnoxious, really I'm not, but I can't think of another way to ask this. :blink:

Why is this such a deal? It doesn't bother me at all if it's a "" instead of “” -- either way I get that it's a quote.... Is it just a matter of preference, or am I missing something here?


As a suggestion to address this, wouldn't it be a “ if it has a non-whitespace character after it, and a ” otherwise? Maybe that helps with the find/replace.

I think I'd try searching for "<whitespace> and replace all those with ”<whitespace>, and then search for all the remaining " and replace with “

Riocaz
08-03-2006, 06:51 AM
JSC: I would suspect the original file. I had similar problems with size/justification/etc when converting them.

Natch: If you are missing something then so am I. I find "zzz" ''zzz'' “zzz” almost indistingushable. So it's a matter of personal preference IMO.

NatCh
08-03-2006, 12:48 PM
NatCh: If you are missing something then so am I. I find "zzz" ''zzz'' “zzz” almost indistingushable. So it's a matter of personal preference IMO.
Thanks, Riocaz, that's the conclusion I was drawing too. Guess it comes down to "I likes what I likes," which is as it should be. :happy2:

jsc
08-08-2006, 12:28 PM
NatCh, your suggestion about spaces before or after is correct, and I've been doing that, but there are situations where it does not apply. Especially with the first author I've been working at converting, who tends to use a lot of m-dashes to interject comments within speeches, you get a lot of ---"text and text"--- and you cannot just assume that the text is quote or commentary.

Why bother? Well, I brought up textquotes specifically in relation to the use of rtf2latex2e. If anyone is going to bother using LaTeX, then there is a higher probability that they have a higher interest in the niggling details of fine typography, making ebooks look like books and not just text files. And text quotes is just one such detail, along with the proper use of hyphens, n-dashes, and m-dashes, ligatures, proper spacing after sentences but not abbreviations, non-breaking spaces, widows and orphans, etc. Thankfully, LaTeX takes care of most of those things automatically, but not the quotes thing.

That's useful only if one cares. I'm not a type-fascist myself, but the iLiad screen is so nice, I thought I might expend the effort for at least a few books just so I have something worthy of the screen. But I find just the plain PDF output from OO reads just fine as well. And the manybooks.net output for iLiad looks really very good.

NatCh
08-08-2006, 12:55 PM
Why bother? Well, I brought up textquotes specifically in relation to the use of rtf2latex2e. If anyone is going to bother using LaTeX, then there is a higher probability that they have a higher interest in the niggling details of fine typography, making ebooks look like books and not just text files. And text quotes is just one such detail, along with the proper use of hyphens, n-dashes, and m-dashes, ligatures, proper spacing after sentences but not abbreviations, non-breaking spaces, widows and orphans, etc.

Well, that makes sense, jsc. Please let me reitterate -- I wasn't trying to get anybody to change what they're doing, just trying to understand. Thanks for explaining. :happy2:

Is it just the hypens that that particular author runs up against the quotes marks? It's probably occurred to you that you could do the same search for hypens as spaces. ( '-"' and '"-' ) Unless of course the joker is also starting his quotations with a hypen: ---"-- quotation"---.

Strikes me as odd that the copy-eds don't rake him over for that. (shrug)

I certainly agree with wanting the text to look the best it can, especially on an e-ink screen (they really are amazing, I can't get over it), adds to the ease of reading at the very least.