|
|
Thread Tools | Search this Thread |
06-20-2020, 11:42 AM | #1 |
Groupie
Posts: 159
Karma: 91148
Join Date: Jun 2010
Device: Sony 350
|
Add title="" to h* based on existing TOC -- suggestion for new feature (or plugin?)
How easy / possible would it be to reverse engineer an existing TOC and add the existing titles as they appear, to a title="" in an h* tag at the appropriate point in the book? If the file is really badly made and the chapter titles are in some random tag like p or div it might be necessary to add a blank h* with a display:none to it.
I'm not sure whether this is better suited to a feature in the Tools > Table of Contents menu or as a plugin (super subtle WINK to any plugin coders who are bored and looking for a new challenge...) but it would be an amazing tool to have. Use cases: 1. you need to combine several epubs into a "collected works" file, or 2. you need to separate a "collected works" file into its individual books and make independent epubs of each, and the original epubs you are given have chapter headings in two (or more) parts, and/or with extraneous code in them which will make regenerating the TOC complicated, for instance : Code:
<h1 epub:type="title" class="part_n"><span>4</span></h1> <h1 epub:type="title" class="part_tit"><span>The#160;Whale speaks of#160;what#160;she has#160;learned about#160;humans</span></h1> Existing (desired) TOC entry : 4. The Whale speaks of what she has learned about humans Or (even worse...) Code:
<h1 id="toc_marker-26">21</h1> <h2><span class="Cap">E</span><span class="SmallCap">N CHEMIN POUR</span> <span class="Cap">S</span><span class="SmallCap">HADAR</span> <span class="Cap">L</span><span class="SmallCap">OGOTH</span></h2> 21. En chemin pour Shadar Logoth (Note, just to be PERFECTLY CLEAR, I had absolutely nothing to do with making these monstrosities originally, or I wouldn't have this problem.) Those examples are taken straight from actual books I'm working on: last week I had to deal with case 2 and this week I've got to tackle case 1 (14 epubs, not a single one of which has chapter titles that will facilitate re-generating the TOC once I've pulled them all into the collection), and that's a lot of fiddly regex-ing and / or hand-coding 1 by 1 to copy the TOC entries into title="" (that's what I did last week because I couldn't think of a better solution, and it was pretty damned annoying just on one book, let alone 14), not for the first time and certainly not for the last either so I'm hoping that by the next time I have to deal with this there will be a better way. If there already is a better way and I just don't know about it (I did go through the plugin index just in case...), by all means PLEASE tell me. |
06-20-2020, 03:29 PM | #2 |
A Hairy Wizard
Posts: 3,099
Karma: 18727053
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
Unfortunately there are so many different examples of how people do them badly...it would be very difficult to encompass all cases. I usually just use regex.
|
Advert | |
|
06-20-2020, 03:34 PM | #3 |
Grand Sorcerer
Posts: 27,552
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
The problem is that Sigil already does the exact reverse of what you want. The title attribute of an h tag (if present) is used by Sigil to generate the text of the ToC. That's what allows users to generate ToCs that have different text than what's between the h tags.
Last edited by DiapDealer; 06-20-2020 at 09:40 PM. |
06-21-2020, 08:53 AM | #4 | |
Groupie
Posts: 159
Karma: 91148
Join Date: Jun 2010
Device: Sony 350
|
Quote:
When the TOC is made, Sigil already knows what each TOC entry should say and what part of the document it is linked to (whether it harvested the info from title attributes or otherwise). It can assemble this info into a new file (nav.xhtml and toc.ncx). It puts it together using appropriate tags. From there, it should be possible to ask it to redistribute the same elements in the opposite direction: the nav is the source of the information rather than the destination and each title is copied back to its destination. Sigil will either have a toc id there, or an h* (with or without a title=""), or nothing if the link just goes to the file. If there is already a title="" overwriting it could be useful if you've made some changes directly in the nav. If there is no title, it can be added. If you are worried about potential conflict with existing code, rather than asking it to add this to a title="" it can be added as an html comment or some other code that seems appropriate to you; maybe something like <section title="Text of title" /> or <a title="Text of title" /> or anything else. From there it would be fairly trivial to regex the text into a title="" and be able to easily regenerate the TOC as needed. Obviously this would be a separate feature to generating the toc even if it's closely related, just like there is a separate "epub3 tools" menu to generate the ncx from the nav, and it wouldn't be necessary for every book, but it would be useful in a lot of cases, and when it's useful it's REALLY useful. I frequently have requests to modify files made by someone else, for example the cases I mentioned above or things like adding a preview of the next book at the end of a book that's already published or a new introduction or something like that. This almost always requires some intervention in the TOC and I have never once seen a book that wasn't my own that made it easy to modify the TOC. Ah, so you also have dealt with this mess. Yes, I use regex too, but it can be really time consuming because of all the variations and ultimately it's always necessary to do some of it by hand. Wouldn't it be nice if you could just grab all the correct titles from the toc and turn them into references and then just click "Generate TOC", instead of messing about with regex? Last edited by Mister L; 06-21-2020 at 09:00 AM. |
|
06-21-2020, 10:07 AM | #5 |
Sigil Developer
Posts: 7,651
Karma: 5433388
Join Date: Nov 2009
Device: many
|
A plugin might be best for this case.
Just so that everyone is on the same page ... It would take an existing nav or ncx,, follow the links back to the target file and element, add a title attribute to it (remembering to html escape any text) based on the current TOC. If existing link is to top of file, inject a new h1 tag with nodisplay set on it with that title. The idea is that after running this plugin, you should be able to regenerate the TOC from h tags in Sigil and get something very very close to the original TOC back. Is that correct? KevinH |
Advert | |
|
06-21-2020, 10:19 AM | #6 | |
Grand Sorcerer
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
BTW, both problems can be easily fixed with the right regular expressions. For example, you could use the following expressions to merge the two <h1> tags: Find:<h1 epub:type="title" class="part_n"><span>(\d+)</span></h1>\s+<h1 epub:type="title" class="part_tit"><span>(.*?)</span></h1> Replace:<h1 epub:type="title" class="part_n" title="\1: \2"><span>\1</span><br /><span class="part_tit">\2</span></h1> If you process the first heading format with it and then generate the TOC, Sigil will add the following entry: 4: The Whale speaks of what she has learned about humans |
|
06-21-2020, 01:43 PM | #7 | ||
Groupie
Posts: 159
Karma: 91148
Join Date: Jun 2010
Device: Sony 350
|
Quote:
Quote:
|
||
06-21-2020, 02:02 PM | #8 | |
Grand Sorcerer
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
|
|
06-22-2020, 07:09 AM | #9 | |
Groupie
Posts: 159
Karma: 91148
Join Date: Jun 2010
Device: Sony 350
|
Quote:
But to be clear, that question is independent of my real question here, because I really do believe there is a better way to handle this specific problem than regex. |
|
06-23-2020, 03:12 PM | #10 |
Groupie
Posts: 159
Karma: 91148
Join Date: Jun 2010
Device: Sony 350
|
Just curious, should I give up on this or does anyone with the skills to make a plugin think it's a good idea? (One day I want to learn to code plugins myself but I do not currently have those skills and not the time to learn them right now).
In case anyone is half-convinced of the usefulness of this hypothetical plugin I'm working on the giant "collected works" book right now and preparing the headings and yet another example of why regex is not the answer when it comes to redoing the TOC are all the parts of the books that are in the TOC but have no title in the page at all (I put a nodisplay h1 in those cases) or have a different title in the TOC to the one displayed in the page, such as the portrait of the author, the copyright page, the cover, the title page which is called "Title page" in the TOC but obviously not in the page, the "By the same author" / bibliography page, Publisher catalogue page... No choice for those cases but to do it all by hand. |
06-23-2020, 03:40 PM | #11 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
I would strongly lean towards No. Sometimes that's what you have to do. Especially if you get some hideous code that's inconsistent spaghetti gobbledeegook like you brought up in this thread. I'm going to pull a JSWolf and say clean the code up and make it consistent first, then your life will be much easier with the Regex going forward. * * * On your Title Casing problem. There are a few solutions, but I've found almost all the be suboptimal and have their own issues on edge cases. Back in 2014, I used this Regex: https://www.mobileread.com/forums/sh...53#post2930153 https://www.mobileread.com/forums/sh...d.php?t=233018 (I still use similar nowadays.) Calibre introduced a "Function Mode" and even has an entire section dedicated in the manual for it, "Automatically fixing the case of headings in the document". But most of the solutions I've come across the years don't take into account the nuances needed for proper Title Casing (different Style Guides require different rules). This is the site I use: https://capitalizemytitle.com/ It handles title casing better than many of the other tools I've run across over the years... and it does handle edge cases like caps after : or EM DASH. But you always get stuff like: DNA, RNA, mRNA, First/Last names (DeSanto, McDonald), etc. Last edited by Tex2002ans; 06-23-2020 at 03:42 PM. |
|
06-23-2020, 04:03 PM | #12 | |||
Groupie
Posts: 159
Karma: 91148
Join Date: Jun 2010
Device: Sony 350
|
Quote:
Either way, to be honest, at this point I would prefer for the real question of this thread to be whether or not there is any hope of seeing a plugin just as we've described. Everything else, at this point, is sort of extraneous to the discussion. Quote:
Quote:
Thanks for those suggestions though, and if I get stuck on something in future I will take a look. The small-caps titles were last week so I don't need that at the moment, I'm working on a different project now. Either way, I don't really want to fiddle around with a different site to fix the cases of 2 words every 3 chapters because frankly at that point it's just faster to do it by hand. Plus it looks like that site is in English, most of the books I work on are in French. Like I said, I do know how to do this the hard way, I am trying to find a better way. |
|||
06-24-2020, 02:48 AM | #13 | ||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Oh jeeze, I completely misread Mister L's and the other posts. I thought Title Casing methods were being discussed like:
Code:
<h2>TEXT</h2> -> <h2 title="Text">TEXT</h2> <h2 title="Text">T<small>EXT</small></h2> -> <h2>Text</h2> After rereading entire thread, I see Mister L meant the EPUB's TOC (nav/NCX) already had the chapters capitalized the way he wanted. I'll do very minor answers here, then toss the enormous tangent in the Workshop in a few days. Quote:
But now I see what you mean by "correct in the file". Quote:
I definitely don't know any title casing tool that handles French exceptions. I've only seen American English only. (More details and edge cases will be in forthcoming topic.) Last edited by Tex2002ans; 06-24-2020 at 02:56 AM. |
||
06-24-2020, 08:32 AM | #14 | |||
Groupie
Posts: 159
Karma: 91148
Join Date: Jun 2010
Device: Sony 350
|
Quote:
Quote:
In fact it's not limited to questions of case, it can also be the presentation of chapter number + title (with a line break or in 2 separate tags in the html, but separated with a point or a dash in the TOC...) or something else. Either way the point is they have already been correctly formatted for the TOC but it's impossible to retrieve that information easily so if you have to modify the TOC you have to re-do all the work which has already been done (in addition to all the work of fixing someone else's terrible code). And as my examples show there is no "one size fits all" solution when you start with the xhtml files so there aren't even any shortcuts, it's really inefficient and frustrating. Quote:
(But yes obviously I would never consider those title formats "correct" in the html files. I think we agree on that question. It's astonishing the terrible state of some files made by so-called "professionals" who have charged for their services. These are books made for publishers and on sale in bookstores.) French I think is simpler than English. The first word of the title is capitalised, and if that word is "The" then the second word generally is as well, but the rest is lower case, except for proper nouns, just like in a sentence. Obviously there can be other exceptions which further complicate the question for regex purposes (roman numerals, acronyms...). Last edited by Mister L; 06-24-2020 at 08:47 AM. |
|||
06-24-2020, 03:32 PM | #15 | ||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
For the most part, the NCX is messed up and I actually want to overwrite with my clean, beautiful code! * * * Another case which might also be helpful is: Original TOC: Code:
“Article Title” by Author Last Code:
<h2>Article Title</h2> <p class="author">Author Last</p> Code:
<h2 title="“Article Title” by Author Last">Article Title</h2> <p class="author">Author Last</p> Quote:
And French with their "XIVth Century" stuff, or their little superscript e. Side Note: One of my favorite games, Europa Universalis IV, takes place during the ~1450s-1850s, and has fans from around the world who are super into history. When discussing history on forums, since most are ESL (English as Second Language), they bring in all these quirky language styles from around the world. Last edited by Tex2002ans; 06-24-2020 at 03:39 PM. |
||
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
GUI Plugin "TOC View Generator" (was: Define Content) | Mick2nd | Plugins | 19 | 02-03-2022 09:41 AM |
V3 "Feature" Full Screen Add Book Dialog | johnelle | Library Management | 3 | 08-11-2017 02:43 PM |
A warning for Linux users: slow "Add Books", "Unknown" title and Author | rolgiati | Library Management | 8 | 07-24-2013 04:36 PM |
"Add existing files" doesn't show all directories | Ripplinger | Sigil | 5 | 02-23-2013 11:43 AM |
Feature Request - TOC Exclude "> My Books" | chrisparker | Library Management | 2 | 10-13-2012 11:44 AM |