![]() |
#1 |
Member
![]() Posts: 10
Karma: 10
Join Date: Dec 2019
Device: HTML
|
Table Of Contents Beyond TOC3
I am using Calibre 4.6.0 to convert a Word .docx file to .html for a web-based user manual.
The conversion path is DOCX > HTMLZ > ZIP. The source .docx file has a table of contents built by Word with four levels extracted from Word's Heading1 - Heading4 styles: eg: 1 This is a First Level Chapter Title 1.1 This is a Second Level Chapter Title 1.1.1 This is a Third Level Chapter Title 1.1.1.1 This is a Fourth Level Chapter Title Calibre does a good job of performing the conversion using the TOC from the Word document source and correctly extracts the four levels into the TOC and hyperlinks to the target headings in the text body. A couple of minor .css tweaks corrects for some text misalignment issues at the target headings. One issue: The fourth level target does not show the chapter level at the target. Example: 1.1.1.1 This is a Fourth Level Chapter Title shows correctly in the TOC, but without the chapter id of 1.1.1.1 at the target text. TOC1-TOC3 show the chapter level correctly in the TOC and at the target text. How to configure Calibre to show the chapter ID for TOC level 4 ? |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,345
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
ToC entry text comes from whatever is included in the toc entry in the docx file, it has nothing to do with levels.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Member
![]() Posts: 10
Karma: 10
Join Date: Dec 2019
Device: HTML
|
OK.
Attached are screenshots from a simplified four level docx file. The first shows the Table of Contents that Word generates from the body text which follows. The body text are "Heading 1" - "Heading 4" styles. The results files show the .html and EPUB output from Calibre. In the body text, the numbering on the level 4 heading is missing and the margin offset for the Level 3 title is incorrect. The .html shows that the first three body text headings are generated with div's but the level 4 heading is generated as an H4 tag. So the Table of Contents generated by Calibre is correct, but the body text does not follow the docx source. How to resolve this ? |
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,345
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
![]() |
![]() |
![]() |
#5 |
Member
![]() Posts: 10
Karma: 10
Join Date: Dec 2019
Device: HTML
|
As requested, attached is a zip archive with the source docx file, Calibre log files from DOCX>HTMLZ, DOCX>EPUB, HTMLZ>ZIP conversions and the Calibre conversion results files.
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Member
![]() Posts: 10
Karma: 10
Join Date: Dec 2019
Device: HTML
|
As a point of reference, the HTML as generated by Word for the test.docx file.
Screenshot from Firefox and the generated HTML. |
![]() |
![]() |
![]() |
#7 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,345
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That file has really weird markup, how are you generating it?
Open the docx file in word, right click on heading 3 style, go to numbering and then change list level and you will see there is a huge indent on that style. That is why you see the indent in the conversion output. Fix that and you will be fine. As for heading 4 not getting numbering it is because its numbering style inherit the lvlid from a parent style, something I have never seen and didn't know was possible, but anyway the next release of calibre will handle that. |
![]() |
![]() |
![]() |
#8 |
Member
![]() Posts: 10
Karma: 10
Join Date: Dec 2019
Device: HTML
|
Thank for the clarifications. I will look into this.
The HTML files contained in the file ExcelWeb.zip were generated via Word "Save As HTML". There are two formats: Unfiltered and Filtered. The one that was sent was the Unfiltered format. This format has all the funky conditionals encoded in it. Really horrendous markup and over 200KB ! Attached is the generated file in the "Filtered" format and a screenshot of the results via Firefox. The end result is the same as Unfiltered with all headings in the body area displayed correctly. |
![]() |
![]() |
![]() |
#9 |
Member
![]() Posts: 10
Karma: 10
Join Date: Dec 2019
Device: HTML
|
Thanks, I will look into the Heading3 style changes.
The HTML was generated via Word's "Save as HTML". There are two variants: Unfiltered: This has the funky mso markup. This is the format that was sent with the horrendous markup. Filtered: This is a relatively clean HTML version. Attached is the output in the filtered format and a screenshot of via Firefox. The results are the same with the body headings rendered correctly. |
![]() |
![]() |
![]() |
#10 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,345
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
No I meant how are you generating the docx file.
|
![]() |
![]() |
![]() |
#11 |
Member
![]() Posts: 10
Karma: 10
Join Date: Dec 2019
Device: HTML
|
docx file composed under Microsoft Office 2019 Word 2019 (version 16.0.12228.20100 - 32 bit)
Windows 10 |
![]() |
![]() |
![]() |
#12 |
null operator (he/him)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 21,723
Karma: 29711016
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
@aap - the zips you posted have MacOS artefacts, are you using calibre on MacOS or Windows? Although, it should not make any difference.
Can you post the DOCX you save from Windows MS Word, you'll need to put it in a zip to post it here, or upload it to dropbox/onedrive/wherever and post the link. BR |
![]() |
![]() |
![]() |
#13 |
Member
![]() Posts: 10
Karma: 10
Join Date: Dec 2019
Device: HTML
|
BetterRed, sorry about that.
I work in mostly an OSX environment (High Sierra) but with various MS Windows and Windows Office versions under Parallels Desktop. The Word and Calibre work is done under Windows 10. I might have used the OSX compress utility to create the .zip file in some of the previous postings. Attached is the .docx file compressed under Windows 10. |
![]() |
![]() |
![]() |
#14 |
null operator (he/him)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 21,723
Karma: 29711016
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
I printed all the styles, see attached PDF. It's a very long list because of all those xnnn styles that I suspect come from Excel.
One thing I notice is that Heading styles 1,2 and 3 are based on Normal whilst Heading style 4 is based on Heading style 3. The TOC styles are similar, some are based on TOC 1 and others on Normal. Which leads me to me ask - why? BR |
![]() |
![]() |
![]() |
#15 |
Member
![]() Posts: 10
Karma: 10
Join Date: Dec 2019
Device: HTML
|
I'm not sure can explain definitively why that is the case.
The origins of this particular file goes back several years and has been edited via several versions of MS Word, both on Mac and PC. The basic chapter framework and automatic TOC has remained the same at four numbered levels. The file is now over 150 pages. I recall that the original chapter heading and TOC styles were "standard" Word styles. However, it is possible that the styles got modified/corrupted over time. That said, the Heading and TOC elements do get displayed correctly in Word on-screen, printed, and via Word's HTML generation capability. I prefer the HTML output from Calibre (as opposed to Word) as it is much cleaner and easier to fold into my HTML5 and CSS3 page templates. It's a trivial hack of the Calibre css to get the headings to align properly, but the lack of the Heading4 numbering is a formidable editing task. Any ideas on how to modify the source Word document and/or styles for a 4 level chapter and TOC would be appreciated. |
![]() |
![]() |
![]() |
Tags |
"table of contents", docx input, html conversion |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Table of Contents not being identified as Table of Contents | openletter | Conversion | 2 | 10-19-2012 12:54 AM |
Table of Contents | fiona86 | Conversion | 1 | 08-11-2011 07:14 AM |
Table of Contents | ucoa | Calibre | 1 | 01-07-2011 09:01 PM |
How to: table of contents | wizzofoz | Sigil | 1 | 10-08-2009 08:22 AM |
only the table of contents | wang960 | Sony Reader | 3 | 08-29-2008 12:45 PM |