Hi PhyrePhox, Me again ...
Quote:
Originally Posted by PhyrePhox
It appears that the original HTML was produced by Word, which has a reputation for producing gnarly code. A poor source indeed!
|
In my opinion, MSWord only produces poor HTML if you let it. HTML output can be greatly improved by
- Using MSWord styles correctly.
- Removing any incorrect hard line breaks before saving.
- Making sure the file is saved as type "WebPage-Filtered" to get simpler HTML without some of the MS "excess baggage".
Quote:
Originally Posted by PhyrePhox
Is there a summary somewhere here of what html tags are meaningful for ebooks? Also, how can I feed the resulting html back into Calibre to convert to epub?
|
I'm afraid I know nothing about editing on a Mac as I have PC/Windows setup, but if you used MSWord as your editor-of-choice these would be the steps I'd take. Perhaps some of it can be "translated" into Mac steps.
- Open a new blank Word doc and import the Calibre-output HTML file you've already got.
- Try to remove the hard line breaks using the editor's Find-and-Replace for mass changes. If you're lucky, the "real" end-of-paragraphs may have a blank line immediately following, or the "real" start-of-paragraphs may have some leading blank spaces. I could elaborate on this if it was relevant to your particular file.
- Use one (or more) of the Word built-in Heading styles (e.g. Heading 2) to mark your chapter headings. Any paragraphs styled as "Heading 2" in Word are created with
<h2> ... </h2> tags in the HTML output.
Similarly, "Heading 1" creates <h1>...</h1> tags etc. Calibre can use these <h1>, <h2> etc tags during conversion to EPUB to specify the TOC.
Any paragraph styled as "Normal" in Word outputs as
<p class=MsoNormal>...</p> in the HTML output.
Any paragraph styled as "Normal (Web)" in Word outputs as
<p>...</p> in the HTML output.
Any paragraph styled as "Plain Text" in Word outputs as
<p class="MsoPlainText">...</p> in the HTML output -- which you've already come across. I'd restyle all of these as "Normal" or "Normal (Web)"
Any text marked as Italic or Bold in Word is output as
<i>...</i> or <b>...</b> in the HTML output.
I tend to use <h1> for Book Title and Author and <h2>, <h3> for Chapters, Sub-titles.
- Save the doc as HTML (as detailed above)
- If you're proficient with CSS files I'd then open up the HTML file in a text editor and remove everything between the <style>...</style> tags and put in a link to an external CSS file which would contain all the styling I wanted, e.g. lines like :-
body {font-size: 100%; font-family: serif; ... ...}
h1 {...}
h2 {...}
p {...}
.MsoNormal {text-indent: 1.5em; ...}
If you're not good with CSS then leave the HTML alone.
- Once you're happy with the HTML then reimport to Calibre by drag-and-drop in the normal way or via the Edit-Metadata feature if you've already set up the book's metadata. Calibre will zip up the HTML file with any linked CSS file and/or images.
- Convert away ... Don't forget to specify the appropriate h1 h2 h3 levels in the "Structure detection" option.
Anyway, that's enough from me for the time being. I don't know how much is relevant for your circumstances but feel free to ask if you think I could help.
Happy New Year