Add title="" to h* based on existing TOC -- suggestion for new feature (or plugin?)

Mister L · 06-20-2020, 11:42 AM

How easy / possible would it be to reverse engineer an existing TOC and add the existing titles as they appear, to a title="" in an h* tag at the appropriate point in the book? If the file is really badly made and the chapter titles are in some random tag like p or div it might be necessary to add a blank h* with a display:none to it.

I'm not sure whether this is better suited to a feature in the Tools > Table of Contents menu or as a plugin (super subtle WINK to any plugin coders who are bored and looking for a new challenge...) but it would be an amazing tool to have.

Use cases:

1. you need to combine several epubs into a "collected works" file, or

2. you need to separate a "collected works" file into its individual books and make independent epubs of each,

and

the original epubs you are given have chapter headings in two (or more) parts, and/or with extraneous code in them which will make regenerating the TOC complicated, for instance :

Code:

    <h1 epub:type="title" class="part_n"><span>4</span></h1>

    <h1 epub:type="title" class="part_tit"><span>The#160;Whale speaks of#160;what#160;she has#160;learned about#160;humans</span></h1>

(note I deleted the "&" to avoid the & #160's being parsed)

Existing (desired) TOC entry :
4. The Whale speaks of what she has learned about humans

Or (even worse...)

Code:

    <h1 id="toc_marker-26">21</h1>

    <h2><span class="Cap">E</span><span class="SmallCap">N CHEMIN POUR</span> <span class="Cap">S</span><span class="SmallCap">HADAR</span> <span class="Cap">L</span><span class="SmallCap">OGOTH</span></h2>

Existing (desired) TOC entry:
21. En chemin pour Shadar Logoth

(Note, just to be PERFECTLY CLEAR, I had absolutely nothing to do with making these monstrosities originally, or I wouldn't have this problem.)

Those examples are taken straight from actual books I'm working on: last week I had to deal with case 2 and this week I've got to tackle case 1 (14 epubs, not a single one of which has chapter titles that will facilitate re-generating the TOC once I've pulled them all into the collection), and that's a lot of fiddly regex-ing and / or hand-coding 1 by 1 to copy the TOC entries into title="" (that's what I did last week because I couldn't think of a better solution, and it was pretty damned annoying just on one book, let alone 14), not for the first time and certainly not for the last either so I'm hoping that by the next time I have to deal with this there will be a better way.

If there already is a better way and I just don't know about it (I did go through the plugin index just in case...), by all means PLEASE tell me.

Turtle91 · 06-20-2020, 03:29 PM

Quote:

Originally Posted by Mister L

...
If there already is a better way and I just don't know about it (I did go through the plugin index just in case...), by all means PLEASE tell me.

Unfortunately there are so many different examples of how people do them badly...it would be very difficult to encompass all cases. I usually just use regex.

DiapDealer · 06-20-2020, 03:34 PM

The problem is that Sigil already does the exact reverse of what you want. The title attribute of an h tag (if present) is used by Sigil to generate the text of the ToC. That's what allows users to generate ToCs that have different text than what's between the h tags.

Mister L · 06-21-2020, 08:53 AM

Quote:

Originally Posted by DiapDealer

The problem is that Sigil already does the exact reverse of what you want. The title attribute of an h tag (if present) is used by Sigil to generate the text of the ToC. That's what allows users to generate ToCs that have different text than what's between the h tags.

Yes, my point exactly.

That is precisely what makes me think it should be possible to add a variation of that feature.

When the TOC is made, Sigil already knows what each TOC entry should say and what part of the document it is linked to (whether it harvested the info from title attributes or otherwise). It can assemble this info into a new file (nav.xhtml and toc.ncx). It puts it together using appropriate tags.
From there, it should be possible to ask it to redistribute the same elements in the opposite direction: the nav is the source of the information rather than the destination and each title is copied back to its destination. Sigil will either have a toc id there, or an h* (with or without a title=""), or nothing if the link just goes to the file. If there is already a title="" overwriting it could be useful if you've made some changes directly in the nav. If there is no title, it can be added.

If you are worried about potential conflict with existing code, rather than asking it to add this to a title="" it can be added as an html comment or some other code that seems appropriate to you; maybe something like <section title="Text of title" /> or <a title="Text of title" /> or anything else. From there it would be fairly trivial to regex the text into a title="" and be able to easily regenerate the TOC as needed.

Obviously this would be a separate feature to generating the toc even if it's closely related, just like there is a separate "epub3 tools" menu to generate the ncx from the nav, and it wouldn't be necessary for every book, but it would be useful in a lot of cases, and when it's useful it's REALLY useful. I frequently have requests to modify files made by someone else, for example the cases I mentioned above or things like adding a preview of the next book at the end of a book that's already published or a new introduction or something like that. This almost always requires some intervention in the TOC and I have never once seen a book that wasn't my own that made it easy to modify the TOC.

Quote:

Originally Posted by Turtle91

Unfortunately there are so many different examples of how people do them badly...it would be very difficult to encompass all cases. I usually just use regex.

Ah, so you also have dealt with this mess. Yes, I use regex too, but it can be really time consuming because of all the variations and ultimately it's always necessary to do some of it by hand. Wouldn't it be nice if you could just grab all the correct titles from the toc and turn them into references and then just click "Generate TOC", instead of messing about with regex?

KevinH · 06-21-2020, 10:07 AM

A plugin might be best for this case.

Just so that everyone is on the same page ...

It would take an existing nav or ncx,, follow the links back to the target file and element, add a title attribute to it (remembering to html escape any text) based on the current TOC. If existing link is to top of file, inject a new h1 tag with nodisplay set on it with that title.

The idea is that after running this plugin, you should be able to regenerate the TOC from h tags in Sigil and get something very very close to the original TOC back.

Is that correct?

KevinH

Doitsu · 06-21-2020, 10:19 AM

Quote:

Originally Posted by Mister L

When the TOC is made, Sigil already knows what each TOC entry should say and what part of the document it is linked to (whether it harvested the info from title attributes or otherwise).

The problem is that heading formats aren't predictable. You yourself gave two examples. In the first example, the heading consisted of two <h1> tags and in the second example it consisted of <h1> and <h2> tags.

BTW, both problems can be easily fixed with the right regular expressions. For example, you could use the following expressions to merge the two <h1> tags:

Find:<h1 epub:type="title" class="part_n"><span>(\d+)</span></h1>\s+<h1 epub:type="title" class="part_tit"><span>(.*?)</span></h1>
Replace:<h1 epub:type="title" class="part_n" title="\1: \2"><span>\1</span><br /><span class="part_tit">\2</span></h1>

If you process the first heading format with it and then generate the TOC, Sigil will add the following entry:

4: The Whale speaks of what she has learned about humans

Mister L · 06-21-2020, 01:43 PM

Quote:

Originally Posted by KevinH

A plugin might be best for this case.

Just so that everyone is on the same page ...

It would take an existing nav or ncx,, follow the links back to the target file and element, add a title attribute to it (remembering to html escape any text) based on the current TOC. If existing link is to top of file, inject a new h1 tag with nodisplay set on it with that title.

The idea is that after running this plugin, you should be able to regenerate the TOC from h tags in Sigil and get something very very close to the original TOC back.

Is that correct?

KevinH

Yes that is exactly right. Would it be difficult to make a plugin for that?

Quote:

Originally Posted by Doitsu

The problem is that heading formats aren't predictable. You yourself gave two examples. In the first example, the heading consisted of two <h1> tags and in the second example it consisted of <h1> and <h2> tags.

BTW, both problems can be easily fixed with the right regular expressions. For example, you could use the following expressions to merge the two <h1> tags:

Find:<h1 epub:type="title" class="part_n"><span>(\d+)</span></h1>\s+<h1 epub:type="title" class="part_tit"><span>(.*?)</span></h1>
Replace:<h1 epub:type="title" class="part_n" title="\1: \2"><span>\1</span><br /><span class="part_tit">\2</span></h1>

If you process the first heading format with it and then generate the TOC, Sigil will add the following entry:

4: The Whale speaks of what she has learned about humans

Yes, so far I have been relying on regex for these cases (and if you have a regex for the fake smallcaps example I'd love to know it, I did that one last week and ended up just copying over the titles by hand). But precisely because they are not predictable, I have to figure out a new regex every time depending on the specific characteristics of the file rather than just having a saved search I can run, and it can be very time-consuming (especially as my regex skills are somewhat limited), and it's a bit frustrating knowing that the exact information needed is already in the book but not easily exploited. I've had a whole series of these cases recently (and another very big one to do this week) which is why I started to think there must be a better way to do it.

Doitsu · 06-21-2020, 02:02 PM

Quote:

Originally Posted by Mister L

[...] and if you have a regex for the fake smallcaps example I'd love to know it, I did that one last week and ended up just copying over the titles by hand).

Why don't you post your fake smallcaps question in the Regex subforum?

Mister L · 06-22-2020, 07:09 AM

Quote:

Originally Posted by Doitsu

Why don't you post your fake smallcaps question in the Regex subforum?

Sure, why not, for my own education.

But to be clear, that question is independent of my real question here, because I really do believe there is a better way to handle this specific problem than regex.

Mister L · 06-23-2020, 03:12 PM

Just curious, should I give up on this or does anyone with the skills to make a plugin think it's a good idea? (One day I want to learn to code plugins myself but I do not currently have those skills and not the time to learn them right now).

In case anyone is half-convinced of the usefulness of this hypothetical plugin

I'm working on the giant "collected works" book right now and preparing the headings and yet another example of why regex is not the answer when it comes to redoing the TOC are all the parts of the books that are in the TOC but have no title in the page at all (I put a nodisplay h1 in those cases) or have a different title in the TOC to the one displayed in the page, such as the portrait of the author, the copyright page, the cover, the title page which is called "Title page" in the TOC but obviously not in the page, the "By the same author" / bibliography page, Publisher catalogue page... No choice for those cases but to do it all by hand.

Tex2002ans · 06-23-2020, 03:40 PM

Quote:

Originally Posted by Mister L

[...] or have a different title in the TOC to the one displayed in the page, such as the portrait of the author, the copyright page, the cover, the title page which is called "Title page" in the TOC but obviously not in the page, the "By the same author" / bibliography page, Publisher catalogue page...

The real question is: Does this belong in a TOC at all?

I would strongly lean towards No.

Quote:

Originally Posted by Mister L

No choice for those cases but to do it all by hand.

Sometimes that's what you have to do. Especially if you get some hideous code that's inconsistent spaghetti gobbledeegook like you brought up in this thread.

I'm going to pull a JSWolf and say clean the code up and make it consistent first, then your life will be much easier with the Regex going forward.

* * *

On your Title Casing problem. There are a few solutions, but I've found almost all the be suboptimal and have their own issues on edge cases.

Back in 2014, I used this Regex:

https://www.mobileread.com/forums/sh...53#post2930153
https://www.mobileread.com/forums/sh...d.php?t=233018

(I still use similar nowadays.)

Calibre introduced a "Function Mode" and even has an entire section dedicated in the manual for it, "Automatically fixing the case of headings in the document".

But most of the solutions I've come across the years don't take into account the nuances needed for proper Title Casing (different Style Guides require different rules).

This is the site I use:

https://capitalizemytitle.com/

It handles title casing better than many of the other tools I've run across over the years... and it does handle edge cases like caps after : or EM DASH.

But you always get stuff like: DNA, RNA, mRNA, First/Last names (DeSanto, McDonald), etc.

Mister L · 06-23-2020, 04:03 PM

Quote:

Originally Posted by Tex2002ans

The real question is: Does this belong in a TOC at all?

I would strongly lean towards No.

Well, you might lean towards no, but sometimes the publisher disagrees, and in many cases I also disagree.

Either way, to be honest, at this point I would prefer for the real question of this thread to be whether or not there is any hope of seeing a plugin just as we've described. Everything else, at this point, is sort of extraneous to the discussion.

Quote:

Originally Posted by Tex2002ans

Sometimes that's what you have to do. Especially if you get some hideous code that's inconsistent spaghetti gobbledeegook like you brought up in this thread.

I'm going to pull a JSWolf and say clean the code up and make it consistent first, then your life will be much easier with the Regex going forward.

Trust me, I don't need to be told to clean up the code.

But part of doing a good job is having the right tools for the job. As I've already said, I DO USE REGEX to clean up the code including to prepare the TOC generation, however I am convinced there is a better way to do that specific thing. It's a separate question to cleaning up the code. Unfortunately I'm not (currently) capable of making the tool I need. If no-one else is interested, that's fine, I'm not going to keep pushing it, it just seemed to me that we got a bit distracted by discussions about regex so I wanted to check where things stood.

Quote:

Originally Posted by Tex2002ans

On your Title Casing problem. There are a few solutions, but I've found almost all the be suboptimal and have their own issues on edge cases.

Back in 2014, I used this Regex:

https://www.mobileread.com/forums/sh...53#post2930153
https://www.mobileread.com/forums/sh...d.php?t=233018

(I still use similar nowadays.)

Calibre introduced a "Function Mode" and even has an entire section dedicated in the manual for it, "Automatically fixing the case of headings in the document".

But most of the solutions I've come across the years don't take into account the nuances needed for proper Title Casing (different Style Guides require different rules).

This is the site I use:

https://capitalizemytitle.com/

It handles title casing better than many of the other tools I've run across over the years... and it does handle edge cases like caps after : or EM DASH.

But you always get stuff like: DNA, RNA, mRNA, First/Last names (DeSanto, McDonald), etc.

You're kind of making my point for me here to be honest... Do you see how complicated this is, when the work has already been done and the correct titles are already in the file?? Wouldn't it be easier to just click on a plugin and copy them over to each chapter?

Thanks for those suggestions though, and if I get stuck on something in future I will take a look. The small-caps titles were last week so I don't need that at the moment, I'm working on a different project now. Either way, I don't really want to fiddle around with a different site to fix the cases of 2 words every 3 chapters because frankly at that point it's just faster to do it by hand. Plus it looks like that site is in English, most of the books I work on are in French.

Like I said, I do know how to do this the hard way, I am trying to find a better way.

Tex2002ans · 06-24-2020, 02:48 AM

Oh jeeze, I completely misread Mister L's and the other posts. I thought Title Casing methods were being discussed like:

Code:

<h2>TEXT</h2> -> <h2 title="Text">TEXT</h2>
<h2 title="Text">T<small>EXT</small></h2> -> <h2>Text</h2>

so I typed up the ultimate "Title Casing: Everything You Didn't Know You Ever Wanted to Know" post.

After rereading entire thread, I see Mister L meant the EPUB's TOC (nav/NCX) already had the chapters capitalized the way he wanted.

I'll do very minor answers here, then toss the enormous tangent in the Workshop in a few days.

Quote:

Originally Posted by Mister L

Do you see how complicated this is, when the work has already been done and the correct titles are already in the file??

Well, the 2nd example you gave in Post #1 wasn't correct in the file... so those recommendations were mostly geared towards cleaning types like that.

But now I see what you mean by "correct in the file".

Quote:

Originally Posted by Mister L

Plus it looks like that site is in English, most of the books I work on are in French.

Yeah, French title casing probably brings in its own issues like lowercase l’ before words, or keeping "pour" lowercase.

I definitely don't know any title casing tool that handles French exceptions. I've only seen American English only. (More details and edge cases will be in forthcoming topic.)

Mister L · 06-24-2020, 08:32 AM

Quote:

Originally Posted by Tex2002ans

Oh jeeze, I completely misread Mister L's and the other posts. I thought Title Casing methods were being discussed like:

Code:

<h2>TEXT</h2> -> <h2 title="Text">TEXT</h2>
<h2 title="Text">T<small>EXT</small></h2> -> <h2>Text</h2>

so I typed up the ultimate "Title Casing: Everything You Didn't Know You Ever Wanted to Know" post.

Heh. Yes I did have the impression we had wandered a little bit.

Quote:

Originally Posted by Tex2002ans

After rereading entire thread, I see Mister L meant the EPUB's TOC (nav/NCX) already had the chapters capitalized the way he wanted.

Correct.

In fact it's not limited to questions of case, it can also be the presentation of chapter number + title (with a line break or in 2 separate tags in the html, but separated with a point or a dash in the TOC...) or something else. Either way the point is they have already been correctly formatted for the TOC but it's impossible to retrieve that information easily so if you have to modify the TOC you have to re-do all the work which has already been done (in addition to all the work of fixing someone else's terrible code). And as my examples show there is no "one size fits all" solution when you start with the xhtml files so there aren't even any shortcuts, it's really inefficient and frustrating.

Quote:

Originally Posted by Tex2002ans

Well, the 2nd example you gave in Post #1 wasn't correct in the file... so those recommendations were mostly geared towards cleaning types like that.

But now I see what you mean by "correct in the file".

"For certain values of 'file'"

(But yes obviously I would never consider those title formats "correct" in the html files. I think we agree on that question. It's astonishing the terrible state of some files made by so-called "professionals" who have charged for their services. These are books made for publishers and on sale in bookstores.)

Quote:

Originally Posted by Tex2002ans

Yeah, French title casing probably brings in its own issues like lowercase l’ before words, or keeping "pour" lowercase.

French I think is simpler than English. The first word of the title is capitalised, and if that word is "The" then the second word generally is as well, but the rest is lower case, except for proper nouns, just like in a sentence. Obviously there can be other exceptions which further complicate the question for regex purposes (roman numerals, acronyms...).

Tex2002ans · 06-24-2020, 03:32 PM

Quote:

Originally Posted by Mister L

It's astonishing the terrible state of some files made by so-called "professionals" who have charged for their services. These are books made for publishers and on sale in bookstores.)

Yeah, that's also why I was downplaying wanting to go from their NCX backwards into the HTML itself.

For the most part, the NCX is messed up and I actually want to overwrite with my clean, beautiful code!

* * *

Another case which might also be helpful is:

Original TOC:

Code:

“Article Title” by Author Last

Original HTML:

Code:

<h2>Article Title</h2>
<p class="author">Author Last</p>

"Proper" Sigil HTML:

Code:

<h2 title="“Article Title” by Author Last">Article Title</h2>
<p class="author">Author Last</p>

99% of the time you want to go HTML->NCX (thus the Sigil Generate TOC), but 1% of the time, you might want to go backwards.

Quote:

Originally Posted by Mister L

French I think is simpler than English. The first word of the title is capitalised, and if that word is "The" then the second word generally is as well, but the rest is lower case, except for proper nouns, just like in a sentence. Obviously there can be other exceptions which further complicate the question for regex purposes (roman numerals, acronyms...).

Oh, I have it all written down... I have it all...

And French with their "XIVth Century" stuff, or their little superscript e.

Side Note: One of my favorite games, Europa Universalis IV, takes place during the ~1450s-1850s, and has fans from around the world who are super into history. When discussing history on forums, since most are ESL (English as Second Language), they bring in all these quirky language styles from around the world.

06-20-2020, 03:34 PM	#3
DiapDealer Grand Sorcerer Posts: 27,552 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	The problem is that Sigil already does the exact reverse of what you want. The title attribute of an h tag (if present) is used by Sigil to generate the text of the ToC. That's what allows users to generate ToCs that have different text than what's between the h tags. Last edited by DiapDealer; 06-20-2020 at 09:40 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
GUI Plugin "TOC View Generator" (was: Define Content)	Mick2nd	Plugins	19	02-03-2022 09:41 AM
V3 "Feature" Full Screen Add Book Dialog	johnelle	Library Management	3	08-11-2017 02:43 PM
A warning for Linux users: slow "Add Books", "Unknown" title and Author	rolgiati	Library Management	8	07-24-2013 04:36 PM
"Add existing files" doesn't show all directories	Ripplinger	Sigil	5	02-23-2013 11:43 AM
Feature Request - TOC Exclude "> My Books"	chrisparker	Library Management	2	10-13-2012 11:44 AM

06-20-2020, 11:42 AM	#1
Mister L Groupie Posts: 159 Karma: 91148 Join Date: Jun 2010 Device: Sony 350	*Add title="" to h based on existing TOC -- suggestion for new feature (or plugin?)** How easy / possible would it be to reverse engineer an existing TOC and add the existing titles as they appear, to a title="" in an h* tag at the appropriate point in the book? If the file is really badly made and the chapter titles are in some random tag like p or div it might be necessary to add a blank h* with a display:none to it. I'm not sure whether this is better suited to a feature in the Tools > Table of Contents menu or as a plugin (super subtle WINK to any plugin coders who are bored and looking for a new challenge...) but it would be an amazing tool to have. Use cases: 1. you need to combine several epubs into a "collected works" file, or 2. you need to separate a "collected works" file into its individual books and make independent epubs of each, and the original epubs you are given have chapter headings in two (or more) parts, and/or with extraneous code in them which will make regenerating the TOC complicated, for instance : Code: <h1 epub:type="title" class="part_n"><span>4</span></h1> <h1 epub:type="title" class="part_tit"><span>The#160;Whale speaks of#160;what#160;she has#160;learned about#160;humans</span></h1> (note I deleted the "&" to avoid the & #160's being parsed) Existing (desired) TOC entry : 4. The Whale speaks of what she has learned about humans Or (even worse...) Code: <h1 id="toc_marker-26">21</h1> <h2><span class="Cap">E</span><span class="SmallCap">N CHEMIN POUR</span> <span class="Cap">S</span><span class="SmallCap">HADAR</span> <span class="Cap">L</span><span class="SmallCap">OGOTH</span></h2> Existing (desired) TOC entry: 21. En chemin pour Shadar Logoth (Note, just to be PERFECTLY CLEAR, I had absolutely nothing to do with making these monstrosities originally, or I wouldn't have this problem.) Those examples are taken straight from actual books I'm working on: last week I had to deal with case 2 and this week I've got to tackle case 1 (14 epubs, not a single one of which has chapter titles that will facilitate re-generating the TOC once I've pulled them all into the collection), and that's a lot of fiddly regex-ing and / or hand-coding 1 by 1 to copy the TOC entries into title="" (that's what I did last week because I couldn't think of a better solution, and it was pretty damned annoying just on one book, let alone 14), not for the first time and certainly not for the last either so I'm hoping that by the next time I have to deal with this there will be a better way. If there already is a better way and I just don't know about it (I did go through the plugin index just in case...), by all means PLEASE tell me.

06-21-2020, 10:07 AM	#5
KevinH Sigil Developer Posts: 7,651 Karma: 5433388 Join Date: Nov 2009 Device: many	A plugin might be best for this case. Just so that everyone is on the same page ... It would take an existing nav or ncx,, follow the links back to the target file and element, add a title attribute to it (remembering to html escape any text) based on the current TOC. If existing link is to top of file, inject a new h1 tag with nodisplay set on it with that title. The idea is that after running this plugin, you should be able to regenerate the TOC from h tags in Sigil and get something very very close to the original TOC back. Is that correct? KevinH

06-23-2020, 03:12 PM	#10
Mister L Groupie Posts: 159 Karma: 91148 Join Date: Jun 2010 Device: Sony 350	Just curious, should I give up on this or does anyone with the skills to make a plugin think it's a good idea? (One day I want to learn to code plugins myself but I do not currently have those skills and not the time to learn them right now). In case anyone is half-convinced of the usefulness of this hypothetical plugin I'm working on the giant "collected works" book right now and preparing the headings and yet another example of why regex is not the answer when it comes to redoing the TOC are all the parts of the books that are in the TOC but have no title in the page at all (I put a nodisplay h1 in those cases) or have a different title in the TOC to the one displayed in the page, such as the portrait of the author, the copyright page, the cover, the title page which is called "Title page" in the TOC but obviously not in the page, the "By the same author" / bibliography page, Publisher catalogue page... No choice for those cases but to do it all by hand.

Advert

Advert