Add title="" to h* based on existing TOC -- suggestion for new feature (or plugin?) - Page 2

Mister L · 06-24-2020, 04:14 PM

Quote:

Originally Posted by Tex2002ans

Yeah, that's also why I was downplaying wanting to go from their NCX backwards into the HTML itself.

For the most part, the NCX is messed up and I actually want to overwrite with my clean, beautiful code!

It's funny, I find that most often, no matter how abominable the xhtml code is, the ncx / toc is usually okay. Weird eh? I do wish I could magically overwrite all the html and css with my own clean, beautiful code though.

Quote:

Originally Posted by Tex2002ans

Another case which might also be helpful is:

Original TOC:

Code:

“Article Title” by Author Last

Original HTML:

Code:

<h2>Article Title</h2>
<p class="author">Author Last</p>

"Proper" Sigil HTML:

Code:

<h2 title="“Article Title” by Author Last">Article Title</h2>
<p class="author">Author Last</p>

99% of the time you want to go HTML->NCX (thus the Sigil Generate TOC), but 1% of the time, you might want to go backwards.

See, now you're getting it.

I don't know how many % it is, certainly I wouldn't need it every day, but nonetheless it would be useful pretty often. Just in the past two weeks, I've had to work on files I didn't originally make for:
- splitting a book into individual tomes
- grouping 14 books into 1 (current project)
- add the first chapter of a new book (which I did make), to 2 other previously published books (which I didn't make), as a preview
- add a new introduction to a book I didn't originally make.

The worst is definitely the current 14-book collection if only for the sheer volume, but the splitting project was annoying as well (that was the one with the fake smallcaps and also I had to renumber all the chapters of the second book after the split to start from 1 instead of 35 or whatever it was; thank god for the AddIDs plugin which at least made that part easy

). But I would definitely use a backwards-ncx plugin often enough to make me really really wish I knew how to code it myself.

Quote:

Originally Posted by Tex2002ans

Oh, I have it all written down... I have it all...

And French with their "XIVth Century" stuff, or their little superscript e.

Side Note: One of my favorite games, Europa Universalis IV, takes place during the ~1450s-1850s, and has fans from around the world who are super into history. When discussing history on forums, since most are ESL (English as Second Language), they bring in all these quirky language styles from around the world.

Haaa, okay, I understand better how you know all this stuff.

Doitsu · 06-25-2020, 12:36 AM

Quote:

Originally Posted by Mister L

Just curious, should I give up on this or does anyone with the skills to make a plugin think it's a good idea?

As a stop-gap, I created a quick & dirty BeautifulSoup-based plugin that adds title attributes to h1..h6 entries. (It doesn't check the TOC and it doesn't merge successive h1..h6 entries.)

It changes:

Code:

<h1 id="toc_marker-26">21</h1>

    <h2><span class="Cap">E</span><span class="SmallCap">N CHEMIN POUR</span> <span class="Cap">S</span><span class="SmallCap">HADAR</span> <span class="Cap">L</span><span class="SmallCap">OGOTH</span></h2>

to:

Code:

 <h1 id="toc_marker-26" title="21">21</h1>
  <h2 title="En chemin pour shadar logoth"><span class="Cap">E</span><span class="SmallCap">N CHEMIN POUR</span> <span class="Cap">S</span><span class="SmallCap">HADAR</span> <span class="Cap">L</span><span class="SmallCap">OGOTH</span></h2>

You can download it from my Dropbox.

The plugin code is:

Spoiler:

For English books, change the following section:

Code:

            title = title.lower().capitalize() # ABC DEF > Abc def
            #title = title.title() # ABC DEF > Abc Def

to:

Code:

            #title = title.lower().capitalize() # ABC DEF > Abc def
            title = title.title() # ABC DEF > Abc Def

DiapDealer · 06-25-2020, 09:04 AM

I'm not interested. There's just too many finicky bits to overcome when going ncx/nav to html (as opposed to going html to ncx/nav). Mainly the situation where the h tag doesn't exist and needs to be created. What level does it get created as (3,4,5,6)? Where in the html will it go (outside/inside an adjacent div/span)? Can't put it at the top because not all chapters begin with a brand-new file. Does it need a class name for styling purposes (display: none)? If so, will I have to parse all css classnames (after determining which external css files are linked) to determine I'm not reusing an existing class? I'd also need to check any css that might be included in the header of the html file to be sure. Otherwise, I'd have to resort to using uuids for class names (after making sure they start with a valid character). And in my opinion, that gets uglier than the issue said plugin is trying to fix--only very occasionally.

Breakage is not an option if I'm to create a plugin. And unfortunately, there's too much that can go wrong when attempting to reverse the automatic toc generation process. Especially if doing so requires generating html that doesn't exist. There's waaay too much to account for (properly) for all the more it would be used.

Someone's free to use all of the problem areas I mentioned to work on it if they like, though.

Mister L · 06-25-2020, 09:32 AM

Quote:

Originally Posted by Doitsu

As a stop-gap, I created a quick & dirty BeautifulSoup-based plugin that adds title attributes to h1..h6 entries. (It doesn't check the TOC and it doesn't merge successive h1..h6 entries.)

Thank you Doitsu, I will try it out.

Quote:

Originally Posted by DiapDealer

I'm not interested. There's just too many finicky bits to overcome when going ncx/nav to html (as opposed to going html to ncx/nav).

No worries, I'm grateful to have Sigil already.

Those are all really good questions so I'll try to answer them the way *I* think is logical, in case anyone does want to try their hand at this.

1. Mainly the situation where the h tag doesn't exist and needs to be created. What level does it get created as (3,4,5,6)?
=> Keep it simple: all toc entries are created as h1 and any modification of levels, if necessary, can be done by hand later on. This seems like a completely reasonable solution to me especially since most situations where the h tag doesn't exist would be "top level" pages anyway rather than subsections, and the level of the h is much less important than the text of the title.

Edit 3: if you are a coding psychopath who enjoys pain you could match h levels to ol levels of the toc. But that would definitely not be *necessary* for the plugin to be useful.

Edit 4(?): Or leave that choice to the user as well, and in that case you could even propose that the title would be inserted in an html comment rather than a tag; then you are safe from any conflict with existing the code but the title text is still available in the file to be added via regex to whatever is appropriate.

2. Where in the html will it go (outside/inside an adjacent div/span)? Can't put it at the top because not all chapters begin with a brand-new file.
=> I would say put it exactly where the link sends it: above (outside) the element containing the toc id if there is one, or at the top of the html file if there isn't an id. I don't expect this to be the end of the process, just a way to easily take care of several annoying and very time-consuming intermediary steps.

3. Does it need a class name for styling purposes (display: none)?
=> Maybe an option to choose between naked h1 and <h1 style="display:none;">, or adding a class chosen by the user. Adding styles seems way beyond the scope of this plugin. Alternatively it could use a style like the split markers use, "sigil_toc_marker" or something; that is very unlikely to clash with local styles.

4. If so, will I have to parse all css classnames (after determining which external css files are linked) to determine I'm not reusing an existing class? I'd also need to check any css that might be included in the header of the html file to be sure. Otherwise, I'd have to resort to using uuids for class names (after making sure they start with a valid character). And in my opinion, that gets uglier than the issue said plugin is trying to fix--only very occasionally.
=> Definitely not. Any styles should be handled by the person making the book, either via an option in the plugin or afterwards.

Quote:

Originally Posted by Doitsu

Breakage is not an option if I'm to create a plugin. And unfortunately, there's too much that can go wrong when attempting to reverse the automatic toc generation process. Especially if doing so requires generating html that doesn't exist. There's waaay to much to account for (properly) for all the more it would be used.

Someone's free to use all of the problem areas I mentioned to work on it if they like, though.

Understood.

As I'm not a dev I realise I may not be fully understanding how complicated this could be... I do think sticking to the absolute simplest solutions would limit the possibilities for breakage though. Thanks for the discussion points anyway, they are helpful.

Edit: Doitsu, I tried to give you karma for the plugin, but it says I must "spread it around" first, so instead I'll just repeat publicly my gratitude.

Edit 2:

Quote:

Originally Posted by Mister L

It's funny, I find that most often, no matter how abominable the xhtml code is, the ncx / toc is usually okay. Weird eh? I do wish I could magically overwrite all the html and css with my own clean, beautiful code though.

Also, thinking about this, it actually makes sense to me. The TOC is visible to the publisher, and so they check it and will complain if there is a problem. Whereas most of them have no idea what good code is, never mind semantic anything, and wouldn't know how to look at it at all or what to look for, so they never say anything about that unless it's so bad it actually breaks somehow.

DiapDealer · 06-25-2020, 04:15 PM

You misunderstood me. I wasn't really asking for human answers to most of those questions. I was asking what kind of spaghetti logic would be required within a plugin to get the plugin to always make the "right" decisions on its own? A human can look at the code and easily (sometimes) see where the new tag should be created. A script has to parse all of the html (and make informed guesses about where it makes sense to put it semantically speaking) to avoid creating malformed or improperly nested tags.

Mister L · 06-29-2020, 11:05 AM

Quote:

Originally Posted by DiapDealer

You misunderstood me. I wasn't really asking for human answers to most of those questions. I was asking what kind of spaghetti logic would be required within a plugin to get the plugin to always make the "right" decisions on its own? A human can look at the code and easily (sometimes) see where the new tag should be created. A script has to parse all of the html (and make informed guesses about where it makes sense to put it semantically speaking) to avoid creating malformed or improperly nested tags.

I do see your point, but I think there might be solutions to avoid making things more complicated than they need to be. In many cases the toc will link to the file. So the added tag would be the first tag in that file. No need to worry about nesting problems or malformed tags in that case.

If the link goes to somewhere inside that file, there will be an id, and the id is necessarily already in a tag of some kind; many times it will be an h tag already. Regardless, I think the easiest solution there is just to add the title directly to the existing tag with the id, even if it's not an h*:
<h1 id="tocid01"> becomes <h1 id="tocid01" title="Title of this toc item">
or
<a id="tocid01"> becomes <a id="tocid01" title="Title of this toc item">
If necessary it can then be added to / turned into a proper h tag by the user. In this case also, no problems with nesting or malformed tags.

If adding an h tag above the id is more complicated than it seems (and I'm willing to accept that it is if you say so), then I think there are safer options that still get the job done; at worst, the absolutely fail-safe one being to just add the text inside an html comment immediately before the tag containing the id, whatever it may be, without trying to figure out where it's safe to add an h tag:

Code:

<some random h*, p, or div opening tag which may or may not be broken by inserting an h tag after it>
<!-- Title of the toc item -->
<a id="tocid01"></a>
Bla
</close random tag>

Code:

<!-- Title of the toc item -->
<h id="tocid01">

The point of the plugin would really be just to get the text of the titles from the toc back into the html files where it is useful and can from there be regexed (or at worst copy-pasted) into whatever level of h tag is appropriate. Like I said, it would be one part of a process, much like the "AddIDs" plugin is one step of something larger. No CSS, minimal (potentially none) HTML.

slowsmile · 06-30-2020, 09:07 AM

I've also had a go at trying to create a plugin for your problem. I doubt whether it will satisfy all your requirements but it might be useful to you.

For this plugin I've tried to imagine worst case scenarios where you either have bad formatting or you have to split an ebook or you have to add some files in Sigil.

This plugin only deals with the main chapter headings for each file that you select in Sigil's Book Browser. It does not deal with h2, h3, h4 etc. That's really up to you. It also doesn't matter what tag is used on the chapter heading in the file because the plugin will always attempt to find the topmost line containing text in every file(which, fingers crossed, should be the main heading). Thus the topmost line with text in any selected file is assumed to be the chapter heading for that file.

With regard to 'proper' title case, I've also included a JSON file which has already been pre-populated to allow certain words to be in lower case like ''of', 'from', 'by', 'to' etc in the title attribute. The JSON file only contains one item -- called 'ignore-titlecase' which contains a space delimited string. You can also add your own words to this string to avoid title case and keep lower case so that you can more easily control title case outcomes in the 'title' attributes for the headings.

To use the plugin you must first click and select all the files in the Book Browser that you want processed by the plugin. You can also use Shift-Click to select a group of files. Then just run the plugin and the title attribute with current heading name will be added to the topmost html tag for each selected file.

Also note that this plugin will not work well if fake smallcaps are used in the headings. See link below:

[Original Plugin has been removed] -- see updated plugin in my post below

DiapDealer · 06-30-2020, 11:32 AM

Quote:

Originally Posted by Mister L

I do see your point, but I think there might be solutions to avoid making things more complicated than they need to be. In many cases the toc will link to the file. So the added tag would be the first tag in that file. No need to worry about nesting problems or malformed tags in that case.

"In many cases" is not useful programmatically. Not when trying to create a script that must work in ALL cases (if harm is to be avoided).

Mister L · 06-30-2020, 09:18 PM

Quote:

Originally Posted by slowsmile

I've also had a go at trying to create a plugin for your problem. I doubt whether it will satisfy all your requirements but it might be useful to you.

For this plugin I've tried to imagine worst case scenarios where you either have bad formatting or you have to split an ebook or you have to add some files in Sigil.

This plugin only deals with the main chapter headings for each file that you select in Sigil's Book Browser. It does not deal with h2, h3, h4 etc. That's really up to you. It also doesn't matter what tag is used on the chapter heading in the file because the plugin will always attempt to find the topmost line containing text in every file(which, fingers crossed, should be the main heading). Thus the topmost line with text in any selected file is assumed to be the chapter heading for that file.

With regard to 'proper' title case, I've also included a JSON file which has already been pre-populated to allow certain words to be in lower case like ''of', 'from', 'by', 'to' etc in the title attribute. The JSON file only contains one item -- called 'ignore-titlecase' which contains a space delimited string. You can also add your own words to this string to avoid title case and keep lower case so that you can more easily control title case outcomes in the 'title' attributes for the headings.

To use the plugin you must first click and select all the files in the Book Browser that you want processed by the plugin. You can also use Shift-Click to select a group of files. Then just run the plugin and the title attribute with current heading name will be added to the topmost html tag for each selected file.

Also note that this plugin will not work well if fake smallcaps are used in the headings. See link below:

Thanks very much, I will try this one out too. If I understand correctly, both your and Doitsu's plugins are starting from the title tags already in the files, rather than starting from the toc; while it's not the solution I'm looking for to my original problem I do think it can be useful when you are making the toc for the first time and maybe save a little bit of regex work so I'm glad to have both of them.

Quote:

Originally Posted by DiapDealer

"In many cases" is not useful programmatically. Not when trying to create a script that must work in ALL cases (if harm is to be avoided).

I was trying to distinguish between cases where there is a toc id on a tag somewhere inside the file, and cases where the toc points just to the file, and since Sigil understands those cases (and can in fact generate the necessary ID's when making the toc) I would think it should be possible for the plugin to understand them too, if it's starting from the toc.

Either way, as I have already mentioned, there is a solution which works for ALL cases unless I'm missing something, which is to insert the text in an html comment. Does what I am after (the user can then easily grab the text and stick it into whatever tag is appropriate with a regex), causes no harm, simplest possible solution that I can think of.

slowsmile · 06-30-2020, 09:25 PM

Quote:

Originally posted by Mister L:
"Thanks very much, I will try this one out too. If I understand correctly, both your and Doitsu's plugins are starting from the title tags already in the files, rather than starting from the toc; while it's not the solution I'm looking for to my original problem I do think it can be useful when you are making the toc for the first time and maybe save a little bit of regex work so I'm glad to have both of them."

As I see it, there is really no need to create the TOC with the plugin because surely it's easier just to add a title attribute(with heading names) to each chapter heading which would then allow you to manually create another TOC in seconds using Generate TOC then Create TOC in Sigil.

Quote:

Originally posted by Mister L:
"If I understand correctly, both your and Doitsu's plugins are starting from the title tags already in the files"

No, my plugin does not assume a title attribute is already in the chapter headings. My plugin always creates and inserts new title attributes with chapter heading values into each chapter heading tag for each selected file. As well, it is also not necessary for the chapter heading to have an h1 tag -- it can have any tag you like -- since my plugin actually finds the chapter headings by choosing the first html line with text in it from the start of each file. As I've already mentioned, I've tried to design the plugin to accommodate and handle worst case file scenarios including files added to your epub that are a mess. So simply assuming that all your selected files will already be using an h1 tag seems a bit short sighted to me, especially if, as you say, you're dealing with either adhoc added files or you're dealing with added files that are complete trainwrecks where absolutely nothing is guaranteed and where nothing should really be expected or assumed.

Thinking about it further in regard to you creating a new TOC, it would also seem sensible to change each discovered chapter heading tag to an h1 tag since, when I tested it, Sigil's Generate TOC does not seem to recognize or find chapter headings with a 'title' attribute value. But if you changed all the chapter heading tags to h1 they will be found when you open the Generate TOC dialog in Sigil. And within Sigil's Generate TOC dialog you can re-type and change heading names, exclude headings and vary indents to your hearts content. In other words there's really no need for you to use regex or anything else to generate and create your new TOC -- you should be able to do it all easily and quickly using Sigil's Generate/Create TOC facility after running the new plugin below.

Anyway, I've modified the original plugin to also rename the html tags of all selected chapter headings to h1 in the new version of the plugin(see below), which I think will be more useful to you.

Mister L · 07-02-2020, 02:45 PM

Quote:

Originally Posted by slowsmile

As I see it, there is really no need to create the TOC with the plugin because surely it's easier just to add a title attribute(with heading names) to each chapter heading which would then allow you to manually create another TOC in seconds using Generate TOC then Create TOC in Sigil.

Yes, absolutely. My question was about where the text of the title was being copied, eg from the toc or from the html file, sorry I was unclear.

Quote:

Originally Posted by slowsmile

No, my plugin does not assume a title attribute is already in the chapter headings. My plugin always creates and inserts new title attributes with chapter heading values into each chapter heading tag for each selected file. As well, it is also not necessary for the chapter heading to have an h1 tag -- it can have any tag you like -- since my plugin actually finds the chapter headings by choosing the first html line with text in it from the start of each file. As I've already mentioned, I've tried to design the plugin to accommodate and handle worst case file scenarios including files added to your epub that are a mess. So simply assuming that all your selected files will already be using an h1 tag seems a bit short sighted to me, especially if, as you say, you're dealing with either adhoc added files or you're dealing with added files that are complete trainwrecks where absolutely nothing is guaranteed and where nothing should really be expected or assumed.

Ok, good to know. I am not sure how often I will need this but I will try it out.

Quote:

Originally Posted by slowsmile

Thinking about it further in regard to you creating a new TOC, it would also seem sensible to change each discovered chapter heading tag to an h1 tag since, when I tested it, Sigil's Generate TOC does not seem to recognize or find chapter headings with a 'title' attribute value. But if you changed all the chapter heading tags to h1 they will be found when you open the Generate TOC dialog in Sigil.

I am not sure about this. If I have a book which has parts and then chapters, the chapters will not be in h1 tags, and if the correct h tags are already present and only the presentation of the title needs to be fixed (for instance, fake small caps, or all caps when I want lower case, or number + title with some added punctuation for the toc, or a toc entry different to what is displayed in the page...), I do not want the plugin to change them. This is actually somewhat outside the scope of what I am looking for.

Quote:

Originally Posted by slowsmile

And within Sigil's Generate TOC dialog you can re-type and change heading names, exclude headings and vary indents to your hearts content. In other words there's really no need for you to use regex or anything else to generate and create your new TOC -- you should be able to do it all easily and quickly using Sigil's Generate/Create TOC facility after running the new plugin below. Anyway, I've modified the original plugin to also rename the html tags of all selected chapter headings to h1 in the new version of the plugin(see below), which I think will be more useful to you.

Changing names in the dialog is very useful if you have only one or two small changes to make, but I don't want to have to re-do every single title that way. My goal is still to be able to preserve the EXISTING correctly presented titles from the existing toc, when they are different to the titles displayed in the html files. It is much more complicated to do this when your starting point for acquiring the text is the html file; modifications and changes will be necessary on a title-by-title basic, which could be avoided if it were possible to simply copy the correct text from the toc back into the html file.

slowsmile · 07-03-2020, 08:14 AM

Here's another plugin that you might like to try out. This plugin does the following:

Gathers all the toc item heading strings from the epub TOC page and puts them into a list.
Finds all headings and subheadings in the epub html using the TOC item heading list.
Adds an associate toc item heading value(from the TOC page) to a new title attribute per heading or per sub-heading found, which will always be in title case. You were not very clear whether you just wanted only main chapter headings tagged with the title attribute or all chapter headings and sub-headings tagged. So I went with all chapter headings and sub-headings tagged with the title attribute.
The new plugin no longer changes any chapter heading tags or part heading tags in the epub -- it leaves them alone.
The new plugin automatically iterates over all the files in your epub - so there's no need to select any files before running the new plugin.
For the plugin to work without problems you must also make sure that the TOC page in the epub has either "Table of Contents" or "Contents" as the TOC page heading.

This will also be my last attempt at trying to create your plugin. See below for the new plugin.

[Plugin has been removed] .

Mister L · 07-03-2020, 07:24 PM

Quote:

Originally Posted by slowsmile

Here's another plugin that you might like to try out. This plugin does the following:

Gathers all the toc item heading strings from the epub TOC page and puts them into a list.

Yes! Brilliant! This is exactly what I was hoping for.

Quote:

Originally Posted by slowsmile

Finds all headings and subheadings in the epub html using the TOC item heading list.

Perfect.

Quote:

Originally Posted by slowsmile

Adds an associate toc item heading value(from the TOC page) to a new title attribute per heading or per sub-heading found, which will always be in title case. You were not very clear whether you just wanted only main chapter headings tagged with the title attribute or all chapter headings and sub-headings tagged. So I went with all chapter headings and sub-headings tagged with the title attribute.

That sounds good. But I don't quite understand "which will always be in title case". If the title is simply being copied from the TOC, then the case should be immaterial because the plugin does not need to modify the case or the text at all.

Quote:

Originally Posted by slowsmile

The new plugin no longer changes any chapter heading tags or part heading tags in the epub -- it leaves them alone.

Perfect.

Quote:

Originally Posted by slowsmile

The new plugin automatically iterates over all the files in your epub - so there's no need to select any files before running the new plugin.

Perfect.

Quote:

Originally Posted by slowsmile

For the plugin to work without problems you must also make sure that the TOC page in the epub has either "Table of Contents" or "Contents" as the TOC page heading.

No problem.

Quote:

Originally Posted by slowsmile

This will also be my last attempt at trying to create your plugin. See below for the new plugin.

Understood. Thank you very much for your time and hard work. I do have a question though as I tried the plugin and it didn't do what I was expecting; maybe I did something wrong.

I took a file which I recently had to split into 2 books. It's an epub2 file with only a toc.ncx so I generated a toc.html with the title <h1 class="sgc-toc-title">Table of Contents</h1> (Sigil default title). Then I ran the plugin.

Chapter 1 appears in the toc.html as:

Code:

<div class="sgc-toc-level-1">
  <a href="9782820516909-5.xhtml#toc_marker-6">1. Le lion sur la colline</a>
</div>

(in blue, the text that should be copied back into the title attribute)

and in the html file as:

Code:

<h1 id="toc_marker-6">1</h1>

<h2><span class="Cap">L</span><span class="SmallCap">E LION SUR LA COLLINE</span></h2>

After running the plugin the result I expected was:

Code:

<h1 id="toc_marker-6" title="1. Le lion sur la colline">1</h1>

<h2><span class="Cap">L</span><span class="SmallCap">E LION SUR LA COLLINE</span></h2>

But instead the result is this:

Code:

<h1 id="toc_marker-6" title="1">1</h1>
<h2><span class="Cap" title="L">L</span><span class="SmallCap">E LION SUR LA COLLINE</span></h2>

The full title from the toc.html doesn't get copied back; only the first character of the text of each h* tag, and one on each h* tag, even though there is only one toc id (on the h1).

I understand if you don't want to spend any more time on this, but I want to check if there is something I did wrong that I can easily fix. If this is something that needs to be tweaked in the plugin itself, if you can give me a hint about which file to modify in the plugin I will see if I can figure out how to fix it myself starting with what you have made.

Thanks again very much for taking a crack at it, I appreciate it. (I tried to give you some karma for that but it says I must "spread it around".)

Hitch · 07-03-2020, 11:01 PM

Quote:

Originally Posted by Mister L

(But yes obviously I would never consider those title formats "correct" in the html files. I think we agree on that question. It's astonishing the terrible state of some files made by so-called "professionals" who have charged for their services. These are books made for publishers and on sale in bookstores.)

I do wish to jump in here--if by "professionals" you mean formatters, formatters are absolutely not responsible for title casing. We're not editors or proofreaders and we are not paid to do that work. In fact, if we "forget our place" and do make corrections, we're typically told off for it. I once had to listen to an ass-chewing by a rather jumped-up self-published author, who informed me that if she wanted her book ruined by "a bunch of self-important clerks," she'd hire some.

(Tex here does do that, for one of his clients in particular, but that's a unique situation.)

So, if you happened to have meant formatters, please know that formatters do NOT make those choices. Believe me, at least once a week I get a manuscript with "forward" in it and punctuation outside of quotation marks where it oughtn't be, incorrect emdash use, and on and on and on, but formatters don't earn remotely enough money to also proofread and correct what we see. And trade publishers? They hire out Indian firms, so...fuhgeddaboudit.

Hitch

slowsmile · 07-03-2020, 11:42 PM

Quote by Mister L:

"I took a file which I recently had to split into 2 books. It's an epub2 file with only a toc.ncx so I generated a toc.html with the title <h1 class="sgc-toc-title">Table of Contents</h1> (Sigil default title). Then I ran the plugin.

Chapter 1 appears in the toc.html as:
Code:

Code:

<div class="sgc-toc-level-1"> <a href="9782820516909-5.xhtml#toc_marker-6">1. Le lion sur la colline</a> </div>

(in blue, the text that should be copied back into the title attribute)

and in the html file as:
Code:

<h1 id="toc_marker-6">1</h1>

Code:

<h2><span class="Cap">L</span><span class="SmallCap">E LION SUR LA COLLINE</span></h2>

After running the plugin the result I expected was:
Code:

Code:

<h1 id="toc_marker-6" title="1. Le lion sur la colline">1</h1> <h2><span class="Cap">L</span><span class="SmallCap">E LION SUR LA COLLINE</span></h2>

But instead the result is this:
Code:

Code:

<h1 id="toc_marker-6" title="1">1</h1> <h2><span class="Cap" title="L">L</span><span class="SmallCap">E LION SUR LA COLLINE</span></h2>

The full title from the toc.html doesn't get copied back; only the first character of the text of each h* tag, and one on each h* tag, even though there is only one toc id (on the h1).

I understand if you don't want to spend any more time on this, but I want to check if there is something I did wrong that I can easily fix. If this is something that needs to be tweaked in the plugin itself, if you can give me a hint about which file to modify in the plugin I will see if I can figure out how to fix it myself starting with what you have made.

Thanks again very much for taking a crack at it, I appreciate it. (I tried to give you some karma for that but it says I must "spread it around".)
Yesterday 07:14 PM"

I think in my original post to you I mentioned that you will not get good results with this plugin if your epub is using fake titlecase or fake smallcaps in your headings. In the code above -- you're using fake titlecase. You're using a span class to capitalize the first letter and then using another span class to make all text after lower case or smallcaps1. Too many span tags in the heading, that's the reason why I can't fix or resolve that formatting problem with my plugin.

If you want to run the plugin without problems then, for each chapter heading in the epub, just remove all that crappy span code between the h1 tags and just type in the heading text that you want to see. You might be able to do that more quickly using Sigil's Search and Replace. If you do that the plugin should run without any problems at all.

06-25-2020, 09:04 AM	#18
DiapDealer Grand Sorcerer Posts: 27,552 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	I'm not interested. There's just too many finicky bits to overcome when going ncx/nav to html (as opposed to going html to ncx/nav). Mainly the situation where the h tag doesn't exist and needs to be created. What level does it get created as (3,4,5,6)? Where in the html will it go (outside/inside an adjacent div/span)? Can't put it at the top because not all chapters begin with a brand-new file. Does it need a class name for styling purposes (display: none)? If so, will I have to parse all css classnames (after determining which external css files are linked) to determine I'm not reusing an existing class? I'd also need to check any css that might be included in the header of the html file to be sure. Otherwise, I'd have to resort to using uuids for class names (after making sure they start with a valid character). And in my opinion, that gets uglier than the issue said plugin is trying to fix--only very occasionally. Breakage is not an option if I'm to create a plugin. And unfortunately, there's too much that can go wrong when attempting to reverse the automatic toc generation process. Especially if doing so requires generating html that doesn't exist. There's waaay too much to account for (properly) for all the more it would be used. Someone's free to use all of the problem areas I mentioned to work on it if they like, though. Last edited by DiapDealer; 06-25-2020 at 04:10 PM.

06-30-2020, 09:07 AM	#22
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	I've also had a go at trying to create a plugin for your problem. I doubt whether it will satisfy all your requirements but it might be useful to you. For this plugin I've tried to imagine worst case scenarios where you either have bad formatting or you have to split an ebook or you have to add some files in Sigil. This plugin only deals with the main chapter headings for each file that you select in Sigil's Book Browser. It does not deal with h2, h3, h4 etc. That's really up to you. It also doesn't matter what tag is used on the chapter heading in the file because the plugin will always attempt to find the topmost line containing text in every file(which, fingers crossed, should be the main heading). Thus the topmost line with text in any selected file is assumed to be the chapter heading for that file. With regard to 'proper' title case, I've also included a JSON file which has already been pre-populated to allow certain words to be in lower case like ''of', 'from', 'by', 'to' etc in the title attribute. The JSON file only contains one item -- called 'ignore-titlecase' which contains a space delimited string. You can also add your own words to this string to avoid title case and keep lower case so that you can more easily control title case outcomes in the 'title' attributes for the headings. To use the plugin you must first click and select all the files in the Book Browser that you want processed by the plugin. You can also use Shift-Click to select a group of files. Then just run the plugin and the title attribute with current heading name will be added to the topmost html tag for each selected file. Also note that this plugin will not work well if fake smallcaps are used in the headings. See link below: [Original Plugin has been removed] -- see updated plugin in my post below Last edited by slowsmile; 07-01-2020 at 07:53 AM.

07-03-2020, 08:14 AM	#27
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	Here's another plugin that you might like to try out. This plugin does the following: Gathers all the toc item heading strings from the epub TOC page and puts them into a list. Finds all headings and subheadings in the epub html using the TOC item heading list. Adds an associate toc item heading value(from the TOC page) to a new title attribute per heading or per sub-heading found, which will always be in title case. You were not very clear whether you just wanted only main chapter headings tagged with the title attribute or all chapter headings and sub-headings tagged. So I went with all chapter headings and sub-headings tagged with the title attribute. The new plugin no longer changes any chapter heading tags or part heading tags in the epub -- it leaves them alone. The new plugin automatically iterates over all the files in your epub - so there's no need to select any files before running the new plugin. For the plugin to work without problems you must also make sure that the TOC page in the epub has either "Table of Contents" or "Contents" as the TOC page heading. This will also be my last attempt at trying to create your plugin. See below for the new plugin. [Plugin has been removed] . Last edited by slowsmile; 07-04-2020 at 03:12 AM.

07-03-2020, 11:42 PM	#30
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	Quote by Mister L: "I took a file which I recently had to split into 2 books. It's an epub2 file with only a toc.ncx so I generated a toc.html with the title <h1 class="sgc-toc-title">Table of Contents</h1> (Sigil default title). Then I ran the plugin. Chapter 1 appears in the toc.html as: Code: Code: <div class="sgc-toc-level-1"> <a href="9782820516909-5.xhtml#toc_marker-6">1. Le lion sur la colline</a> </div> (in blue, the text that should be copied back into the title attribute) and in the html file as: Code: <h1 id="toc_marker-6">1</h1> Code: <h2><span class="Cap">L</span><span class="SmallCap">E LION SUR LA COLLINE</span></h2> After running the plugin the result I expected was: Code: Code: <h1 id="toc_marker-6" title="1. Le lion sur la colline">1</h1> <h2><span class="Cap">L</span><span class="SmallCap">E LION SUR LA COLLINE</span></h2> But instead the result is this: Code: Code: <h1 id="toc_marker-6" title="1">1</h1> <h2><span class="Cap" title="L">L</span><span class="SmallCap">E LION SUR LA COLLINE</span></h2> The full title from the toc.html doesn't get copied back; only the first character of the text of each h* tag, and one on each h* tag, even though there is only one toc id (on the h1). I understand if you don't want to spend any more time on this, but I want to check if there is something I did wrong that I can easily fix. If this is something that needs to be tweaked in the plugin itself, if you can give me a hint about which file to modify in the plugin I will see if I can figure out how to fix it myself starting with what you have made. Thanks again very much for taking a crack at it, I appreciate it. (I tried to give you some karma for that but it says I must "spread it around".) Yesterday 07:14 PM" I think in my original post to you I mentioned that you will not get good results with this plugin if your epub is using fake titlecase or fake smallcaps in your headings. In the code above -- you're using fake titlecase. You're using a span class to capitalize the first letter and then using another span class to make all text after lower case or smallcaps1. Too many span tags in the heading, that's the reason why I can't fix or resolve that formatting problem with my plugin. If you want to run the plugin without problems then, for each chapter heading in the epub, just remove all that crappy span code between the h1 tags and just type in the heading text that you want to see. You might be able to do that more quickly using Sigil's Search and Replace. If you do that the plugin should run without any problems at all. Last edited by slowsmile; 07-04-2020 at 02:36 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
GUI Plugin "TOC View Generator" (was: Define Content)	Mick2nd	Plugins	19	02-03-2022 09:41 AM
V3 "Feature" Full Screen Add Book Dialog	johnelle	Library Management	3	08-11-2017 02:43 PM
A warning for Linux users: slow "Add Books", "Unknown" title and Author	rolgiati	Library Management	8	07-24-2013 04:36 PM
"Add existing files" doesn't show all directories	Ripplinger	Sigil	5	02-23-2013 11:43 AM
Feature Request - TOC Exclude "> My Books"	chrisparker	Library Management	2	10-13-2012 11:44 AM

06-25-2020, 04:15 PM	#20
DiapDealer Grand Sorcerer Posts: 27,552 Karma: 193191846 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	You misunderstood me. I wasn't really asking for human answers to most of those questions. I was asking what kind of spaghetti logic would be required within a plugin to get the plugin to always make the "right" decisions on its own? A human can look at the code and easily (sometimes) see where the new tag should be created. A script has to parse all of the html (and make informed guesses about where it makes sense to put it semantically speaking) to avoid creating malformed or improperly nested tags.

Advert

Advert