RTF vs HTML---best way to convert my files?

ficbot · 04-23-2010, 11:25 PM

I have been using HTML for my converted secure eReader files and this has lately been problematic. The HTML is very messy and has required numerous conversions---what was fine on the Sony was not fine on the Kindle, which was not fine on the Libre etc. etc. etc. I just want one basic file I can re-convert to any future format and read on all devices now. Presently, I want to convert to mobi and have an error-free file.

After going through a dozen HTML files, I found issues with line breaks, straight vs curly quotes and numerous inconsistencies. It seems I think all is fixed and then I find some other error. I am running it through Kompozer, copy and pasting the result from Firefox into a clean file, and I guess I just don't know enough about which problems to catch. I am wondering if it might be better to just copy the HTML into an RTF file and convert THAT in the future?

So what should I do? Copy and paste from firefox into Word and make them all RTF files, or develop some sort of HTML checklist I can use to verify---once and for all---the perfection of my files and then make HTML my archival format? I no longer buy secure eReader but I have about 200 files already and just don't have the heart to keep going through them all again every time I want to use a different reader (I review for Teleread and often test new ones). I just want one base file which is fine that I can re-convert forever and ever.

ficbot · 04-24-2010, 12:38 AM

Just replying to say that I have done some experimenting and both are imperfect. I think I am going to stick with HTML unless anyone has any better ideas. I have tried Sigil too and it didn't really help. Can someone point me toward a checklist of things I need to run a find and replace on? So far, I have figured out I need to replace double line breaks with proper <p> and </p? paragraphing, and also check apostrophes, em-dashes and curly quotes and do find and replaces for those. What else? My goal is to go through all these files---thoroughly---one more time and make sure they are as perfect as possible. I will be VERY dismayed to do all that checking on 267 files and then find something else I need to fix! I am at my wit's end here, I am not prepared to buy all these books again just to get a proper epub. I regret ever getting involved int he secure eReader racket. I just want files that will display on all my devices with proper paragraphs and without funny symbols where the quote marks should be. I am prepared so spend some time on the fixing, but not if I get a new reader down the road and will have to do it all again!

pepak · 04-24-2010, 05:00 AM

Quote:

Originally Posted by ficbot

I am wondering if it might be better to just copy the HTML into an RTF file and convert THAT in the future?

Certainly not. RTF is, if anything, even worse to get into a "clean state" than HTML. What I would recommend is to stick to XHTML (because the extra limitations over basic HTML allow for easier automatic conversions), and with a very limited version of it - the fewer tags you use, the less likely you are to encounter problems. If you want an inspiration, download this:
http://www.pepak.net/files/e-books/u...ble_people.zip
It produces a file which is reasonably simple to convert to any format I tried.

Quote:

So what should I do? Copy and paste from firefox into Word and make them all RTF files, or develop some sort of HTML checklist I can use to verify---once and for all---the perfection of my files and then make HTML my archival format?

I would recommend HTML. Unfortunately, you will need to descend to the roots and do all coding yourself - if you use some graphical editors, the result will likely be poor.

Quote:

I just want one base file which is fine that I can re-convert forever and ever.

HTML+CSS is a good solution, as is, with certain limitations, XML+XSLT. You may also want to look at my H2LRF, which I use for precisely the task you want - one source format which I can easily (read: changing one parameter of the command, or changing one source file for all books) convert to any format.

pepak · 04-24-2010, 05:06 AM

Also, this article of mine might be of interest:
http://www.pepak.net/e-books/vycisteni-html-knihy/
It deals with cleaning up HTML source (from FineReader) to the state you see in that Unspeakable People demo using regular expressions. Unfortunately, it is written in Czech language, but you may be OK with Google Translation. Quick look reveals gems such as "Cutting off heads" (="Remove headers"), but it will give you an idea (you MUST combine it with the Czech version, though, because Google Translator destroys all CODE blocks) and besides, regular expressions and HTML are the same in all languages. Also, I provide ZIPped source files before and after each cleanup step, which will guide you a bit more.

If there is enough interest, I may be willing to translate the article to english eventually.

roger64 · 04-24-2010, 05:59 AM

I read Czech a little but so poorly that I gave up on your interesting article some time ago.

A translation in English would indeed be very much appreciated.

ficbot · 04-24-2010, 08:49 AM

So what would be your checklist of eventual things to fix? So far I have found issues with curly quotes and apostrophes, so I went through and fixed it and then had trouble with em-dashes. I tried saving as plain text and they didn't convert tor regular ones. So if I have to do a manual find and replace, I need a comprehensive list of what to look for so I only have to go through this once.

frabjous · 04-24-2010, 09:30 AM

I agree that HTML is the best source file format these days for easy conversion to others.

Have you tried just saving the HTML file directly from inside Firefox?

I wouldn't go through KompoZer or any other WYSIWYG editor if I could possibly help it. In fact, the constantly trouble I had with KompoZer screwing up my HTML files was the reason I finally decided to ditch WYSIWYG editors altogether.

And the terribly quality of Word-generated HTML files is legendary.

If you really must use a Word Processor, I've found that AbiWord tends to generate somewhat-decent HTML output for converting.

The issues with quotation marks and en/em-dashes is probably a matter of saving the file in the wrong character encoding. I would think that saving the HTML file through Firefox itself would keep it in its original encoding. I guess you could do it manually by looking at what encoding Firefox is using (under View>Character encoding while viewing the page), and then copy and paste the source code into a sophisticated text editor (NOT something like Notepad!... but maybe, e.g., Notepad++), and then make sure it saves it in the same encoding. (I don't really know what the good editors are for Windows or mac, since I use linux.) But I would hope Firefox would take care of that for you if you just File > Save Page As...

But if you're really interested in this stuff, learning the HTML/CSS yourself. The tutorials at w3schools.com are quick, free, and probably thorough enough for your purposes.

ficbot · 04-24-2010, 09:53 AM

Tried saving from Firefox and got no line breaks on the Libre and the same issue with things like em-dashes. I think what I need is a checklist like replace smart quotes, replace apostrophe, replace em-dash etc. but I don't know what else to add. It's frustrating because some of these looked fine on the Kindle and I read them there so I don't want to spend precious reading time re-reading them line by line on another device just to check them all when I have so much else to read. I just want to know my source files are in order for future conversions and want to get them in order once and for all.

ficbot · 04-24-2010, 10:37 AM

Update: someone suggested I save to mobi from epub instead of HTML and it looks like that solved all the problems. But I don't know what's going on behind the scenes. Is the resulting epub and/or mobi file 'clean' now and can it be my master file? I am just so sick of dealing with all of this. I don't buy this format anymore but am not prepared to throw away the books I have already. Can converting to epub and then using the mobi from that really solve all my problems? If so---

1) Do I still need to keep the original HTML?
2) If not, can I ditch the epub too and convert from mobi in the future?
3) Or should I save the epub (converted from HTML) for some other reason?
4) Will the epub or mobi master be better than the original HTML for future use?
5) Anything going on behind the scenes wit these files which might be a problem later?

HarryT · 04-24-2010, 12:58 PM

I'd suggest keeping the ePub as the "master" file. ePub is easily edited, and easily converted to other formats. Additionally, compared to HTML, it packages everything together into a single file - text, images, metadata, etc.

pepak · 04-24-2010, 01:32 PM

Unfortunately, the current EPUB-generating tools leave a LOT to be desired. For example, Calibre-generated EPUB files are OK for display but almost useless for conversion as they contain too much junk.

frabjous · 04-24-2010, 03:18 PM

Sounds to me like there's something wrong with the Libre's ePub rendering. It's hard to understand why you'd get such bad results with it. Do the same ePubs look OK in Adobe Digital Editions?

ficbot · 04-24-2010, 05:37 PM

The epubs look fine but I don't prefer to use epubs since the page turning button on the right side does not work with the epub files, only with mobi. It is the mobi files I am having trouble with. For example:

- Standard HTML converted to mobi (fine on the Kindle) had no page line breaks
- RTF converted to mobi (terrible on both) lots of errors for em-dashes and such
- RTF saved to HTML and then converted to mobi (fine on Kindle) had line breaks but also had formatting glitches

Best so far has been HTML converted to epub and then the epub converted to mobi (i.e. not converting the HTML to mobi but using the epub file as the source). This will take awhile though since Calibre is slow in doing conversions for me. So before I go ahead and do them all, I want to make sure nothing is going on behind the hood that will force me re-do all of this later for some reason.

pepak · 04-25-2010, 12:19 AM

It seems to me you are looking for some magical conversion tool that will take a crappy input and produce a clean and easy-to-convert output. Unfortunately, there is no such tool. All converters try to handle such a situation, but the results are mixed and usually much worse than you would get by hand-editing. Sure, hand-editing is a lot of work and needs a lot of knowledge, but it can be done and once you try it a few times, the process is quite easy and straightforward. Unfortunately, it is not something that could be summarized into "replace A with B" list - a lot of the necessary steps are done in a "I look at it and see the solution right" way.

frabjous · 04-25-2010, 12:33 AM

Try playing around with HTML Tidy - it does a lot of these things, but it may have a steep learning curve... not really sure; haven't played around with it enough myself.

04-23-2010, 11:25 PM	#1
ficbot Wizard Posts: 2,409 Karma: 4132096 Join Date: Sep 2008 Device: Kindle Paperwhite/iOS Kindle App	RTF vs HTML---best way to convert my files? I have been using HTML for my converted secure eReader files and this has lately been problematic. The HTML is very messy and has required numerous conversions---what was fine on the Sony was not fine on the Kindle, which was not fine on the Libre etc. etc. etc. I just want one basic file I can re-convert to any future format and read on all devices now. Presently, I want to convert to mobi and have an error-free file. After going through a dozen HTML files, I found issues with line breaks, straight vs curly quotes and numerous inconsistencies. It seems I think all is fixed and then I find some other error. I am running it through Kompozer, copy and pasting the result from Firefox into a clean file, and I guess I just don't know enough about which problems to catch. I am wondering if it might be better to just copy the HTML into an RTF file and convert THAT in the future? So what should I do? Copy and paste from firefox into Word and make them all RTF files, or develop some sort of HTML checklist I can use to verify---once and for all---the perfection of my files and then make HTML my archival format? I no longer buy secure eReader but I have about 200 files already and just don't have the heart to keep going through them all again every time I want to use a different reader (I review for Teleread and often test new ones). I just want one base file which is fine that I can re-convert forever and ever.

04-24-2010, 09:30 AM	#7
frabjous Wizard Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505	I agree that HTML is the best source file format these days for easy conversion to others. Have you tried just saving the HTML file directly from inside Firefox? I wouldn't go through KompoZer or any other WYSIWYG editor if I could possibly help it. In fact, the constantly trouble I had with KompoZer screwing up my HTML files was the reason I finally decided to ditch WYSIWYG editors altogether. And the terribly quality of Word-generated HTML files is legendary. If you really must use a Word Processor, I've found that AbiWord tends to generate somewhat-decent HTML output for converting. The issues with quotation marks and en/em-dashes is probably a matter of saving the file in the wrong character encoding. I would think that saving the HTML file through Firefox itself would keep it in its original encoding. I guess you could do it manually by looking at what encoding Firefox is using (under View>Character encoding while viewing the page), and then copy and paste the source code into a sophisticated text editor (NOT something like Notepad!... but maybe, e.g., Notepad++), and then make sure it saves it in the same encoding. (I don't really know what the good editors are for Windows or mac, since I use linux.) But I would hope Firefox would take care of that for you if you just File > Save Page As... But if you're really interested in this stuff, learning the HTML/CSS yourself. The tutorials at w3schools.com are quick, free, and probably thorough enough for your purposes. Last edited by frabjous; 04-24-2010 at 09:33 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
LRFTools. Convert LRF to EPUB, HTML, PDF and RTF	elinares	LRF	279	07-30-2011 11:48 PM
Unable to convert RTF files to ePub	Chrysanthemum	Calibre	14	07-07-2010 01:57 PM
Cannot Convert HTML to RTF	LightGuard	Calibre	1	06-27-2010 10:37 AM
Can't convert RTF files	sglinert	Calibre	10	06-08-2010 11:03 AM
Can't convert RTF files	sglinert	Calibre	0	06-06-2010 10:14 PM

04-24-2010, 12:38 AM	#2
ficbot Wizard Posts: 2,409 Karma: 4132096 Join Date: Sep 2008 Device: Kindle Paperwhite/iOS Kindle App	Just replying to say that I have done some experimenting and both are imperfect. I think I am going to stick with HTML unless anyone has any better ideas. I have tried Sigil too and it didn't really help. Can someone point me toward a checklist of things I need to run a find and replace on? So far, I have figured out I need to replace double line breaks with proper <p> and </p? paragraphing, and also check apostrophes, em-dashes and curly quotes and do find and replaces for those. What else? My goal is to go through all these files---thoroughly---one more time and make sure they are as perfect as possible. I will be VERY dismayed to do all that checking on 267 files and then find something else I need to fix! I am at my wit's end here, I am not prepared to buy all these books again just to get a proper epub. I regret ever getting involved int he secure eReader racket. I just want files that will display on all my devices with proper paragraphs and without funny symbols where the quote marks should be. I am prepared so spend some time on the fixing, but not if I get a new reader down the road and will have to do it all again!

04-24-2010, 05:06 AM	#4
pepak Guru Posts: 610 Karma: 4150 Join Date: Mar 2008 Device: Sony Reader PRS-T3, Kobo Libra H2O	Also, this article of mine might be of interest: http://www.pepak.net/e-books/vycisteni-html-knihy/ It deals with cleaning up HTML source (from FineReader) to the state you see in that Unspeakable People demo using regular expressions. Unfortunately, it is written in Czech language, but you may be OK with Google Translation. Quick look reveals gems such as "Cutting off heads" (="Remove headers"), but it will give you an idea (you MUST combine it with the Czech version, though, because Google Translator destroys all CODE blocks) and besides, regular expressions and HTML are the same in all languages. Also, I provide ZIPped source files before and after each cleanup step, which will guide you a bit more. If there is enough interest, I may be willing to translate the article to english eventually.

04-24-2010, 05:59 AM	#5
roger64 Wizard Posts: 2,608 Karma: 3000161 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	I read Czech a little but so poorly that I gave up on your interesting article some time ago. A translation in English would indeed be very much appreciated.

04-24-2010, 08:49 AM	#6
ficbot Wizard Posts: 2,409 Karma: 4132096 Join Date: Sep 2008 Device: Kindle Paperwhite/iOS Kindle App	So what would be your checklist of eventual things to fix? So far I have found issues with curly quotes and apostrophes, so I went through and fixed it and then had trouble with em-dashes. I tried saving as plain text and they didn't convert tor regular ones. So if I have to do a manual find and replace, I need a comprehensive list of what to look for so I only have to go through this once.

04-24-2010, 09:53 AM	#8
ficbot Wizard Posts: 2,409 Karma: 4132096 Join Date: Sep 2008 Device: Kindle Paperwhite/iOS Kindle App	Tried saving from Firefox and got no line breaks on the Libre and the same issue with things like em-dashes. I think what I need is a checklist like replace smart quotes, replace apostrophe, replace em-dash etc. but I don't know what else to add. It's frustrating because some of these looked fine on the Kindle and I read them there so I don't want to spend precious reading time re-reading them line by line on another device just to check them all when I have so much else to read. I just want to know my source files are in order for future conversions and want to get them in order once and for all.

04-24-2010, 10:37 AM	#9
ficbot Wizard Posts: 2,409 Karma: 4132096 Join Date: Sep 2008 Device: Kindle Paperwhite/iOS Kindle App	Update: someone suggested I save to mobi from epub instead of HTML and it looks like that solved all the problems. But I don't know what's going on behind the scenes. Is the resulting epub and/or mobi file 'clean' now and can it be my master file? I am just so sick of dealing with all of this. I don't buy this format anymore but am not prepared to throw away the books I have already. Can converting to epub and then using the mobi from that really solve all my problems? If so--- 1) Do I still need to keep the original HTML? 2) If not, can I ditch the epub too and convert from mobi in the future? 3) Or should I save the epub (converted from HTML) for some other reason? 4) Will the epub or mobi master be better than the original HTML for future use? 5) Anything going on behind the scenes wit these files which might be a problem later?

04-24-2010, 12:58 PM	#10
HarryT eBook Enthusiast Posts: 85,544 Karma: 93383043 Join Date: Nov 2006 Location: UK Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6	I'd suggest keeping the ePub as the "master" file. ePub is easily edited, and easily converted to other formats. Additionally, compared to HTML, it packages everything together into a single file - text, images, metadata, etc.

04-24-2010, 01:32 PM	#11
pepak Guru Posts: 610 Karma: 4150 Join Date: Mar 2008 Device: Sony Reader PRS-T3, Kobo Libra H2O	Unfortunately, the current EPUB-generating tools leave a LOT to be desired. For example, Calibre-generated EPUB files are OK for display but almost useless for conversion as they contain too much junk.

04-24-2010, 03:18 PM	#12
frabjous Wizard Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505	Sounds to me like there's something wrong with the Libre's ePub rendering. It's hard to understand why you'd get such bad results with it. Do the same ePubs look OK in Adobe Digital Editions?

04-24-2010, 05:37 PM	#13
ficbot Wizard Posts: 2,409 Karma: 4132096 Join Date: Sep 2008 Device: Kindle Paperwhite/iOS Kindle App	The epubs look fine but I don't prefer to use epubs since the page turning button on the right side does not work with the epub files, only with mobi. It is the mobi files I am having trouble with. For example: - Standard HTML converted to mobi (fine on the Kindle) had no page line breaks - RTF converted to mobi (terrible on both) lots of errors for em-dashes and such - RTF saved to HTML and then converted to mobi (fine on Kindle) had line breaks but also had formatting glitches Best so far has been HTML converted to epub and then the epub converted to mobi (i.e. not converting the HTML to mobi but using the epub file as the source). This will take awhile though since Calibre is slow in doing conversions for me. So before I go ahead and do them all, I want to make sure nothing is going on behind the hood that will force me re-do all of this later for some reason.

Advert

Advert

04-25-2010, 12:19 AM	#14
pepak Guru Posts: 610 Karma: 4150 Join Date: Mar 2008 Device: Sony Reader PRS-T3, Kobo Libra H2O	It seems to me you are looking for some magical conversion tool that will take a crappy input and produce a clean and easy-to-convert output. Unfortunately, there is no such tool. All converters try to handle such a situation, but the results are mixed and usually much worse than you would get by hand-editing. Sure, hand-editing is a lot of work and needs a lot of knowledge, but it can be done and once you try it a few times, the process is quite easy and straightforward. Unfortunately, it is not something that could be summarized into "replace A with B" list - a lot of the necessary steps are done in a "I look at it and see the solution right" way.

04-25-2010, 12:33 AM	#15
frabjous Wizard Posts: 1,213 Karma: 12890 Join Date: Feb 2009 Location: Amherst, Massachusetts, USA Device: Sony PRS-505	Try playing around with HTML Tidy - it does a lot of these things, but it may have a steep learning curve... not really sure; haven't played around with it enough myself.