|
|
View Full Version : RTF conversion.
Riocaz 07-21-2006, 08:19 AM Because it bugs me...
Anyone know of a tool which will strip font and size tags from an rtf file, but leaves the bold and italic tags in place?
It would aide me in converting my rtf files to html (as they bloat the end file if you convert straight. For example I stripped 1.4mb of unnessessary crud from one I was playign with yesterday and dropped the file size from 2.5MB to 1.1MB)
I know I saw such a tool when I was searching for an rtf/html conversion tool, but unfortunately I diddn't grab it at the time, and now cannot find it.
ElaHuguet 07-21-2006, 08:24 AM There's a tool called... Tidy something-or-other (TidyHTML? TidyUI? I have it at home), which is really cool for cleaning up code, also works great for cleaning up general MS Word crud. It's free, so you can check it out. Just google it, I found it that way.
Riocaz 07-21-2006, 08:38 AM Doh! why diddn't I think of that? I was so focussed on finding something to "fix" the input file, I diddn't think of "fixing" the output. (even though I was trying to do that manually).
That looks perfect, I will have a play and tell you all how it does.
ElaHuguet 07-21-2006, 08:55 AM Hehehe... you're welcome, I found it easy to use.
meisterz 07-24-2006, 05:37 PM Where can I find this Tidy?? program?
Thanks
branko 07-24-2006, 06:07 PM It's the first link in Google. Do you know what Google is?
meisterz 07-25-2006, 09:27 AM By the sarcasm I assume it is either tidyhtml http://www.tucows.com/preview/206197 or tidy ui http://www.forums.devnetwork.net/viewtopic.php?p=156958&sid=d0c4504a83e8d6475be6b53b800f978c
branko 07-25-2006, 01:59 PM Sarcasm?
yokos 08-01-2006, 11:37 AM If you are a fan of almighty LaTeX give rtf2LaTeX a try. It works fine. http://sourceforge.net/projects/rtf2latex2e
rlauzon 08-01-2006, 11:47 AM Anyone know of a tool which will strip font and size tags from an rtf file, but leaves the bold and italic tags in place?
I usually use OpenOffice.
1. I convert the RTF into an HTML file.
2. Reload the HTML file back into OpenOffice.
3. I use the source view to do a Find/Replace on all the offending tags.
4. Then I convert the HTML into a regular OpenOffice file to save it.
5. And finally, I export to PDF to put it on my iLiad.
I just got rtf2latex2e compiled for OS X and used it to convert a Baen RTF. It helped some, but I have to say it's simply not very good. I had to deal with a great number of unbalanced environment tags (italics started but never ended) and a large section of boldface which was not visible in the RTF. This may be because the RTF file included some formatting badness, but I think the above suggestion to use OpenOffice to convert to xhtml is better.
Unfortunately, the xhtml generated by OpenOffice uses CSS heavily, so it's not always obvious what markup to substitute. Italics is not done with an i tag, it is a p tag with a CSS class. Still, it puts it in a format that's at least workable. The final issue is to replace double and single quotes with appropriate text quotes, for which I'm working on a script to do heuristically (you can't just count on there being left-right quote pairs, since multiple paragraphs in quotes are traditionally started with but not ended by a text quote).
NatCh 08-02-2006, 06:33 PM The final issue is to replace double and single quotes with appropriate text quotes
Okay, I'm not trying to be obnoxious, really I'm not, but I can't think of another way to ask this. :blink:
Why is this such a deal? It doesn't bother me at all if it's a "" instead of “” -- either way I get that it's a quote.... Is it just a matter of preference, or am I missing something here?
As a suggestion to address this, wouldn't it be a “ if it has a non-whitespace character after it, and a ” otherwise? Maybe that helps with the find/replace.
I think I'd try searching for "<whitespace> and replace all those with ”<whitespace>, and then search for all the remaining " and replace with “
Riocaz 08-03-2006, 06:51 AM JSC: I would suspect the original file. I had similar problems with size/justification/etc when converting them.
Natch: If you are missing something then so am I. I find "zzz" ''zzz'' “zzz” almost indistingushable. So it's a matter of personal preference IMO.
NatCh 08-03-2006, 12:48 PM NatCh: If you are missing something then so am I. I find "zzz" ''zzz'' “zzz” almost indistingushable. So it's a matter of personal preference IMO.
Thanks, Riocaz, that's the conclusion I was drawing too. Guess it comes down to "I likes what I likes," which is as it should be. :happy2:
NatCh, your suggestion about spaces before or after is correct, and I've been doing that, but there are situations where it does not apply. Especially with the first author I've been working at converting, who tends to use a lot of m-dashes to interject comments within speeches, you get a lot of ---"text and text"--- and you cannot just assume that the text is quote or commentary.
Why bother? Well, I brought up textquotes specifically in relation to the use of rtf2latex2e. If anyone is going to bother using LaTeX, then there is a higher probability that they have a higher interest in the niggling details of fine typography, making ebooks look like books and not just text files. And text quotes is just one such detail, along with the proper use of hyphens, n-dashes, and m-dashes, ligatures, proper spacing after sentences but not abbreviations, non-breaking spaces, widows and orphans, etc. Thankfully, LaTeX takes care of most of those things automatically, but not the quotes thing.
That's useful only if one cares. I'm not a type-fascist myself, but the iLiad screen is so nice, I thought I might expend the effort for at least a few books just so I have something worthy of the screen. But I find just the plain PDF output from OO reads just fine as well. And the manybooks.net output for iLiad looks really very good.
NatCh 08-08-2006, 12:55 PM Why bother? Well, I brought up textquotes specifically in relation to the use of rtf2latex2e. If anyone is going to bother using LaTeX, then there is a higher probability that they have a higher interest in the niggling details of fine typography, making ebooks look like books and not just text files. And text quotes is just one such detail, along with the proper use of hyphens, n-dashes, and m-dashes, ligatures, proper spacing after sentences but not abbreviations, non-breaking spaces, widows and orphans, etc.
Well, that makes sense, jsc. Please let me reitterate -- I wasn't trying to get anybody to change what they're doing, just trying to understand. Thanks for explaining. :happy2:
Is it just the hypens that that particular author runs up against the quotes marks? It's probably occurred to you that you could do the same search for hypens as spaces. ( '-"' and '"-' ) Unless of course the joker is also starting his quotations with a hypen: ---"-- quotation"---.
Strikes me as odd that the copy-eds don't rake him over for that. (shrug)
I certainly agree with wanting the text to look the best it can, especially on an e-ink screen (they really are amazing, I can't get over it), adds to the ease of reading at the very least.
|