04-23-2010, 11:25 PM | #1 |
Wizard
Posts: 2,409
Karma: 4132096
Join Date: Sep 2008
Device: Kindle Paperwhite/iOS Kindle App
|
RTF vs HTML---best way to convert my files?
I have been using HTML for my converted secure eReader files and this has lately been problematic. The HTML is very messy and has required numerous conversions---what was fine on the Sony was not fine on the Kindle, which was not fine on the Libre etc. etc. etc. I just want one basic file I can re-convert to any future format and read on all devices now. Presently, I want to convert to mobi and have an error-free file.
After going through a dozen HTML files, I found issues with line breaks, straight vs curly quotes and numerous inconsistencies. It seems I think all is fixed and then I find some other error. I am running it through Kompozer, copy and pasting the result from Firefox into a clean file, and I guess I just don't know enough about which problems to catch. I am wondering if it might be better to just copy the HTML into an RTF file and convert THAT in the future? So what should I do? Copy and paste from firefox into Word and make them all RTF files, or develop some sort of HTML checklist I can use to verify---once and for all---the perfection of my files and then make HTML my archival format? I no longer buy secure eReader but I have about 200 files already and just don't have the heart to keep going through them all again every time I want to use a different reader (I review for Teleread and often test new ones). I just want one base file which is fine that I can re-convert forever and ever. |
04-24-2010, 12:38 AM | #2 |
Wizard
Posts: 2,409
Karma: 4132096
Join Date: Sep 2008
Device: Kindle Paperwhite/iOS Kindle App
|
Just replying to say that I have done some experimenting and both are imperfect. I think I am going to stick with HTML unless anyone has any better ideas. I have tried Sigil too and it didn't really help. Can someone point me toward a checklist of things I need to run a find and replace on? So far, I have figured out I need to replace double line breaks with proper <p> and </p? paragraphing, and also check apostrophes, em-dashes and curly quotes and do find and replaces for those. What else? My goal is to go through all these files---thoroughly---one more time and make sure they are as perfect as possible. I will be VERY dismayed to do all that checking on 267 files and then find something else I need to fix! I am at my wit's end here, I am not prepared to buy all these books again just to get a proper epub. I regret ever getting involved int he secure eReader racket. I just want files that will display on all my devices with proper paragraphs and without funny symbols where the quote marks should be. I am prepared so spend some time on the fixing, but not if I get a new reader down the road and will have to do it all again!
|
Advert | |
|
04-24-2010, 05:00 AM | #3 | |||
Guru
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
Quote:
http://www.pepak.net/files/e-books/u...ble_people.zip It produces a file which is reasonably simple to convert to any format I tried. Quote:
Quote:
|
|||
04-24-2010, 05:06 AM | #4 |
Guru
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
Also, this article of mine might be of interest:
http://www.pepak.net/e-books/vycisteni-html-knihy/ It deals with cleaning up HTML source (from FineReader) to the state you see in that Unspeakable People demo using regular expressions. Unfortunately, it is written in Czech language, but you may be OK with Google Translation. Quick look reveals gems such as "Cutting off heads" (="Remove headers"), but it will give you an idea (you MUST combine it with the Czech version, though, because Google Translator destroys all CODE blocks) and besides, regular expressions and HTML are the same in all languages. Also, I provide ZIPped source files before and after each cleanup step, which will guide you a bit more. If there is enough interest, I may be willing to translate the article to english eventually. |
04-24-2010, 05:59 AM | #5 |
Wizard
Posts: 2,608
Karma: 3000161
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
|
I read Czech a little but so poorly that I gave up on your interesting article some time ago.
A translation in English would indeed be very much appreciated. |
Advert | |
|
04-24-2010, 08:49 AM | #6 |
Wizard
Posts: 2,409
Karma: 4132096
Join Date: Sep 2008
Device: Kindle Paperwhite/iOS Kindle App
|
So what would be your checklist of eventual things to fix? So far I have found issues with curly quotes and apostrophes, so I went through and fixed it and then had trouble with em-dashes. I tried saving as plain text and they didn't convert tor regular ones. So if I have to do a manual find and replace, I need a comprehensive list of what to look for so I only have to go through this once.
|
04-24-2010, 09:30 AM | #7 |
Wizard
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
I agree that HTML is the best source file format these days for easy conversion to others.
Have you tried just saving the HTML file directly from inside Firefox? I wouldn't go through KompoZer or any other WYSIWYG editor if I could possibly help it. In fact, the constantly trouble I had with KompoZer screwing up my HTML files was the reason I finally decided to ditch WYSIWYG editors altogether. And the terribly quality of Word-generated HTML files is legendary. If you really must use a Word Processor, I've found that AbiWord tends to generate somewhat-decent HTML output for converting. The issues with quotation marks and en/em-dashes is probably a matter of saving the file in the wrong character encoding. I would think that saving the HTML file through Firefox itself would keep it in its original encoding. I guess you could do it manually by looking at what encoding Firefox is using (under View>Character encoding while viewing the page), and then copy and paste the source code into a sophisticated text editor (NOT something like Notepad!... but maybe, e.g., Notepad++), and then make sure it saves it in the same encoding. (I don't really know what the good editors are for Windows or mac, since I use linux.) But I would hope Firefox would take care of that for you if you just File > Save Page As... But if you're really interested in this stuff, learning the HTML/CSS yourself. The tutorials at w3schools.com are quick, free, and probably thorough enough for your purposes. Last edited by frabjous; 04-24-2010 at 09:33 AM. |
04-24-2010, 09:53 AM | #8 |
Wizard
Posts: 2,409
Karma: 4132096
Join Date: Sep 2008
Device: Kindle Paperwhite/iOS Kindle App
|
Tried saving from Firefox and got no line breaks on the Libre and the same issue with things like em-dashes. I think what I need is a checklist like replace smart quotes, replace apostrophe, replace em-dash etc. but I don't know what else to add. It's frustrating because some of these looked fine on the Kindle and I read them there so I don't want to spend precious reading time re-reading them line by line on another device just to check them all when I have so much else to read. I just want to know my source files are in order for future conversions and want to get them in order once and for all.
|
04-24-2010, 10:37 AM | #9 |
Wizard
Posts: 2,409
Karma: 4132096
Join Date: Sep 2008
Device: Kindle Paperwhite/iOS Kindle App
|
Update: someone suggested I save to mobi from epub instead of HTML and it looks like that solved all the problems. But I don't know what's going on behind the scenes. Is the resulting epub and/or mobi file 'clean' now and can it be my master file? I am just so sick of dealing with all of this. I don't buy this format anymore but am not prepared to throw away the books I have already. Can converting to epub and then using the mobi from that really solve all my problems? If so---
1) Do I still need to keep the original HTML? 2) If not, can I ditch the epub too and convert from mobi in the future? 3) Or should I save the epub (converted from HTML) for some other reason? 4) Will the epub or mobi master be better than the original HTML for future use? 5) Anything going on behind the scenes wit these files which might be a problem later? |
04-24-2010, 12:58 PM | #10 |
eBook Enthusiast
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
I'd suggest keeping the ePub as the "master" file. ePub is easily edited, and easily converted to other formats. Additionally, compared to HTML, it packages everything together into a single file - text, images, metadata, etc.
|
04-24-2010, 01:32 PM | #11 |
Guru
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
Unfortunately, the current EPUB-generating tools leave a LOT to be desired. For example, Calibre-generated EPUB files are OK for display but almost useless for conversion as they contain too much junk.
|
04-24-2010, 03:18 PM | #12 |
Wizard
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
Sounds to me like there's something wrong with the Libre's ePub rendering. It's hard to understand why you'd get such bad results with it. Do the same ePubs look OK in Adobe Digital Editions?
|
04-24-2010, 05:37 PM | #13 |
Wizard
Posts: 2,409
Karma: 4132096
Join Date: Sep 2008
Device: Kindle Paperwhite/iOS Kindle App
|
The epubs look fine but I don't prefer to use epubs since the page turning button on the right side does not work with the epub files, only with mobi. It is the mobi files I am having trouble with. For example:
- Standard HTML converted to mobi (fine on the Kindle) had no page line breaks - RTF converted to mobi (terrible on both) lots of errors for em-dashes and such - RTF saved to HTML and then converted to mobi (fine on Kindle) had line breaks but also had formatting glitches Best so far has been HTML converted to epub and then the epub converted to mobi (i.e. not converting the HTML to mobi but using the epub file as the source). This will take awhile though since Calibre is slow in doing conversions for me. So before I go ahead and do them all, I want to make sure nothing is going on behind the hood that will force me re-do all of this later for some reason. |
04-25-2010, 12:19 AM | #14 |
Guru
Posts: 610
Karma: 4150
Join Date: Mar 2008
Device: Sony Reader PRS-T3, Kobo Libra H2O
|
It seems to me you are looking for some magical conversion tool that will take a crappy input and produce a clean and easy-to-convert output. Unfortunately, there is no such tool. All converters try to handle such a situation, but the results are mixed and usually much worse than you would get by hand-editing. Sure, hand-editing is a lot of work and needs a lot of knowledge, but it can be done and once you try it a few times, the process is quite easy and straightforward. Unfortunately, it is not something that could be summarized into "replace A with B" list - a lot of the necessary steps are done in a "I look at it and see the solution right" way.
|
04-25-2010, 12:33 AM | #15 |
Wizard
Posts: 1,213
Karma: 12890
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
|
Try playing around with HTML Tidy - it does a lot of these things, but it may have a steep learning curve... not really sure; haven't played around with it enough myself.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
LRFTools. Convert LRF to EPUB, HTML, PDF and RTF | elinares | LRF | 279 | 07-30-2011 11:48 PM |
Unable to convert RTF files to ePub | Chrysanthemum | Calibre | 14 | 07-07-2010 01:57 PM |
Cannot Convert HTML to RTF | LightGuard | Calibre | 1 | 06-27-2010 10:37 AM |
Can't convert RTF files | sglinert | Calibre | 10 | 06-08-2010 11:03 AM |
Can't convert RTF files | sglinert | Calibre | 0 | 06-06-2010 10:14 PM |