Bug or feature of the TOC generator?

Artha · 11-22-2011, 01:38 PM

I have 0.4.2 and try to do by hand a book from PDF to ePub. I have changed the file to barebones HTML and will attach a CSS file later. Now, things should be nice and clean with only the HTML tags and nothing more.

Yet when I hit „Generate TOC from headings” an id="heading_id_2" or id="heading_id_3" is attached to the headings. Why is that?

And can it be disabled?

Serpentine · 11-22-2011, 02:05 PM

The ToC needs to tell the viewer where the element is, this is done by using the id attribute, it makes no sense to remove it, and will break the epub in any case. If the elements already have id's, then Sigil will not generate them.

Artha · 11-22-2011, 02:26 PM

So, in the end there is a TOC file in the final ePub?

theducks · 11-22-2011, 02:36 PM

Quote:

Originally Posted by Artha

So, in the end there is a TOC file in the final ePub?

Yes, it is part of the NCX file. EPUB does not need the inline TOC that mobi needs.

Artha · 11-22-2011, 03:12 PM

Oh! That makes sense. Thanks.

Serpentine · 11-22-2011, 05:52 PM

If you want to strip attributes from most? tags you can try using something like :

Code:

find:
<(/)?\b(h\d|[uod]l|[pisbuq]|hr|br|abbr|acronym|address|area|base|basefont|bdo|big|blockquote|body|caption|center|cite|code|col|colgroup|dd|del|dfn|dir|div|dt|em|font|frame|frameset|hr|ins|kbd|label|legend|li|map|menu|noframes|noscript|object|param|pre|samp|select|small|span|strike|strong|sub|sup|table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|var)\b([^<>]+[^/])(/)?>
replace:
<\1\2\4>

You will however need to make _sure_ that you are not removing font formatting which might be important, for example: calibre often uses a span class to mark italic (etc) text, instead of tags - read the CSS and replace the tags correctly. You will also need to regenerate if you leave in <h> tags. Remove tags as needed and well..., be careful.

Artha · 11-23-2011, 03:24 AM

Weird. Why would Calibre use span, or , when there's for that?

JSWolf · 11-23-2011, 03:27 AM

Quote:

Originally Posted by Artha

I have 0.4.2 and try to do by hand a book from PDF to ePub. I have changed the file to barebones HTML and will attach a CSS file later. Now, things should be nice and clean with only the HTML tags and nothing more.

Yet when I hit „Generate TOC from headings” an id="heading_id_2" or id="heading_id_3" is attached to the headings. Why is that?

And can it be disabled?

You don't need the id="heading_id_2" if each chapter is a separate file. All you do in the NCX is call the file you want for each chapter entry without needing the # anchor.

What I do is use regex to strip it. I would search for od="heading_id_[0-9]*" and replace with nothing. This works in Notepad++. I've not tried it in Sigil so I do not know if that regex would work. Someone may be able to fix it if it's incorrect.

Quote:

Originally Posted by Artha

Weird. Why would Calibre use span, or , when there's for that?

Because that's what is in the HTML generated from the PDF.

I've seen code from some conversions were there was something like text of the book in every line and it got worse with italics. I was able to regex remove most of it and then manually remove it for every line that had italics.

With Calibre, a lot of the oddities are in the source fed to it.

theducks · 11-23-2011, 09:25 AM

@JSWolf
I would use (in Sigil)
search for:

Code:

\s+id="heading_id_\d+"

Which is fine for numeric only of any digit count

and replace with nothing

JSWolf · 11-25-2011, 05:59 PM

What type of regex does Sigil use in case I need to look online for help?

opitzs · 11-26-2011, 11:03 PM

It uses QT Regex at the moment, but with 0.5 this will be changed to PRCE Regex. I can't wait...

11-22-2011, 01:38 PM	#1
Artha ----- Posts: 114 Karma: 10 Join Date: Jun 2011 Device: Samsung SNE65	Bug or feature of the TOC generator? I have 0.4.2 and try to do by hand a book from PDF to ePub. I have changed the file to barebones HTML and will attach a CSS file later. Now, things should be nice and clean with only the HTML tags and nothing more. Yet when I hit „Generate TOC from headings” an id="heading_id_2" or id="heading_id_3" is attached to the headings. Why is that? And can it be disabled?

11-22-2011, 05:52 PM	#6
Serpentine Evangelist Posts: 416 Karma: 1045911 Join Date: Sep 2011 Location: Cape Town, South Africa Device: Kindle 3	If you want to strip attributes from most? tags you can try using something like : Code: find: <(/)?\b(h\d\|[uod]l\|[pisbuq]\|hr\|br\|abbr\|acronym\|address\|area\|base\|basefont\|bdo\|big\|blockquote\|body\|caption\|center\|cite\|code\|col\|colgroup\|dd\|del\|dfn\|dir\|div\|dt\|em\|font\|frame\|frameset\|hr\|ins\|kbd\|label\|legend\|li\|map\|menu\|noframes\|noscript\|object\|param\|pre\|samp\|select\|small\|span\|strike\|strong\|sub\|sup\|table\|tbody\|td\|textarea\|tfoot\|th\|thead\|title\|tr\|tt\|var)\b([^<>]+[^/])(/)?> replace: <\1\2\4> You will however need to make _sure_ that you are not removing font formatting which might be important, for example: calibre often uses a span class to mark italic (etc) text, instead of <i> tags - read the CSS and replace the tags correctly. You will also need to regenerate if you leave in <h> tags. Remove tags as needed and well..., be careful.

11-23-2011, 03:24 AM	#7
Artha ----- Posts: 114 Karma: 10 Join Date: Jun 2011 Device: Samsung SNE65	Weird. Why would Calibre use span, or <i>, when there's <em> for that?

11-23-2011, 09:25 AM	#9
theducks Well trained by Cats Posts: 30,889 Karma: 59840450 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	@JSWolf I would use (in Sigil) search for: Code: \s+id="heading_id_\d+" Which is fine for numeric only of any digit count and replace with nothing

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
NCX file generator (and html ToC and opf)	GiorgioC	Workshop	0	07-12-2011 06:55 AM
Bug or feature in iBooks	Chang	ePub	6	02-18-2011 07:30 AM
Import Date: Bug or Feature?	DobraGolonka	Calibre	19	08-24-2010 11:47 AM
Bug or Feature?	capidamonte	Calibre	5	07-27-2010 03:06 PM

11-22-2011, 02:05 PM	#2
Serpentine Evangelist Posts: 416 Karma: 1045911 Join Date: Sep 2011 Location: Cape Town, South Africa Device: Kindle 3	The ToC needs to tell the viewer where the element is, this is done by using the id attribute, it makes no sense to remove it, and will break the epub in any case. If the elements already have id's, then Sigil will not generate them.

11-22-2011, 02:26 PM	#3
Artha ----- Posts: 114 Karma: 10 Join Date: Jun 2011 Device: Samsung SNE65	So, in the end there is a TOC file in the final ePub?

11-22-2011, 03:12 PM	#5
Artha ----- Posts: 114 Karma: 10 Join Date: Jun 2011 Device: Samsung SNE65	Oh! That makes sense. Thanks.

11-25-2011, 05:59 PM	#10
JSWolf Resident Curmudgeon Posts: 79,012 Karma: 144284074 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	What type of regex does Sigil use in case I need to look online for help?

11-26-2011, 11:03 PM	#11
opitzs Avid Reader Posts: 161 Karma: 36472 Join Date: Sep 2008 Location: Look for rain, hail and snow... Device: PRS-505, PRS-600, PRS T1, Kobo Glo	It uses QT Regex at the moment, but with 0.5 this will be changed to PRCE Regex. I can't wait...