[Plugin] IDErrorCheck

slowsmile · 10-18-2017, 07:41 PM

Checks, repairs and reports all id errors in the epub

Requirements
Plugin Type: Edit
MIT Licence(OSI)
Minimum Sigil requirement: v0.9.3 or higher
Python Requirements: Python 3.4+ (Bundled or External)
OS Requirements: Windows, Linux or OSX
*** Tested on Windows 7, 8 & 10 only ***
Current Version: "0.2.2"

Installation
* Select Manage Plugins from the Plugins menu. In the dialog box, select either the Bundled Python or the External Python(Python 3.4+ should be installed on your computer to run this plugin externally).

* Click Add Plugin and select IDErrorCheck_vXXX.zip. This will load and install the plugin into Sigil, which you can then run by selecting Plugins > Edit > IDErrorCheck

Description
This plugin was originally written with the sole intention of properly reporting and, if possible, fixing Epubcheck's infamous "colon" id error problems. This plugin now also does the following:

* Converts all "name" attributes to "id" attributes in the html files.

* Now checks and repairs all invalid id attribute values in the epub's html files. Checks and repairs illegal spaces and illegal first-digit-start errors and also checks and repairs other illegal non-alphanumerics that commonly occur within id attribute values.(v0.1.5)

* Also checks and repairs all internal links that contain bad bookmarks associated with the above html id problems.(v0.1.5)

* Checks and repairs all book uuid values in the toc.ncx and content.opf. If an illegal book uuid value is found then another unique uuid will be automatically generated to replace it.(v0.1.5)

* Now checks and repairs all navPoint id values in the toc.ncx.(v0.1.5)

* Checks and logs all id errors occurring in the content.opf manifest or spine wihout fixing them.

* Will properly check, flag and identify Epubcheck's "colon" id errors and fix these errors.

* At the end of the plugin run, an error dialog will display a simple error list showing all relevant information about each id error including associated file, line number, reason and bad id.

Caveat
Don't use the "Mend and prettify..." Sigil feature directly after using this plugin. Doing so will change and increase the number of lines in the html files so that any reported error line numbers generated by the plugin automatically become inaccurate and void.

Plugin Run
First load your epub into Sigil and then just run the plugin. If you only want to know which errors have not been fixed then just run the plugin twice. The first time you run the plugin the display log will show you errors that have been fixed or not fixed. The second time you run the plugin will only show you what has not been fixed.

Update: This plugin can now process epubs that contain svg images without giving svg errors in Epubcheck.

Change Log:

Spoiler:

AlanHK · 10-19-2017, 04:54 AM

Is this plugin's functionality now all included in your CustomCleanerPlus plugin?

A note: you seem to change IDs beginning with a digit by replacing that digit with an x.
Which will probably be fine, but could create duplicate IDs, e.g.:

id="1" id="2"
both become id="x"

I manually corrected IDs by prepending X. There must be a limit to the length of an ID string, so I guess you should check if adding a character would push it over that if you were really being careful.
Or just forget the original ID and regen them all.

slowsmile · 10-19-2017, 07:26 AM

@AlanHK...

Quote:

Is this plugin's functionality now all included in your CustomCleanerPlus plugin?

No this code hasn't been added to the CustomerCleanerPlus plugin(CCP). The reason for this is because CCP is a cleaner for html files and epubs, which has nothing really to do with checking or fixing ids.

The just-released IDErrorCheck does swap in an 'x' char for first char digit errors only. It also substitutes an underscore in all id values that have illegal spaces. It also regens both book ids in the toc.ncx and content.opf files if they are bad. That's all it fixes. All other illegal id values -- such as those containing illegal non-alphanumeriic chars -- are just reported. ID attribute errors in the content.opf are also not fixed -- just reported -- because of the complex rules and myriad dependencies between ids and hrefs within the content.opf and toc.ncx.

DiapDealer · 10-19-2017, 08:19 AM

I think what he's saying is that replacing any first-digits in an id with an 'x' could possibly result in identical ids in the same html file. Prepending the 'x' (instead of swapping) would at least guarantee that already unique ids would stay that way.

slowsmile · 10-19-2017, 07:07 PM

@DiapDealer...I'll try and put in the suggested change. This change will only apply to fixing the first char digit errors in the epub.

slowsmile · 10-19-2017, 08:36 PM

Plugin Update: The plugin has been updated(v0.1.2):

*Changed handling of illegal first char digit id errors. These errors are now fixed by prepending(not substituting) an 'x' char into the id value string. Thanks to AlanHK & DiapDealer.

slowsmile · 10-25-2017, 07:07 PM

Could someone please add this new plugin to the Sigil Plugin Index? Thanks in advance.

KevinH · 10-27-2017, 01:18 PM

Just added it.

BeckyEbook · 03-13-2018, 09:26 AM

Plugin replace id after hash for illegal first-digit-start errors, but incorrect IDs are do not fix.

Sample illegal ID:

Code:

<h1 id="123abc">Chapter 1</h1>

Sample link to illegal ID:

Code:

<a href="../Text/start.xhtml#123abc">Chapter 1</a>

First sample is not corrected.

Second is corrected to:

Code:

<a href="../Text/start.xhtml#x123abc">Chapter 1</a>

slowsmile · 03-14-2018, 06:10 AM

@Becky...It's certainly true what you say. But here's what it says in the release notes:

Quote:

* Checks and, if possible, repairs all invalid id attribute values in the epub's html files.

* Also checks and, if possible, repairs internal links that contain bad bookmarks associated with the above html id problems.

* Checks and, if possible, repairs all navPoint id values in the toc.ncx.

The above means that it will not fix every single id problem. I saw no point in fixing all id problems because giving you the line number and the reason for the id fail should really be enough for you to fix the id problem. And the main reason that I wrote this app was because Epubcheck did not describe id problems very well. This plugin was really just an attempt to give proper reasons for any id failure as well as point the user accurately to the problem line in the epub.

If you want to see the problem that Epubcheck has with describing bad ids then you could try running your test epub(with bad ids) through Epubcheck. Then you will see the problem with Epubcheck's strange error messaging, which always seems to involve phantom colons that aren't there.

BeckyEbook · 03-14-2018, 06:55 AM

Thanks for the clarification.
I also understand "phantom colon", because in most cases this is the id that starts with a number.

However ... Where do I get the "proper reasons for any id failure"?
In IDErrorCheck Log are only records regarding changes made (in the example epub file it is the toc.xhtml file)

Why in log has no records about the start.xhtml file and incorrect IDs?

Information about the changes made is valuable, but the file still remains with incorrect identifiers.

EpubCheck gives even more results, because not only does it provide:

Code:

Error while parsing file 'value of attribute" id "is invalid; must be an XML name without colons'.

additionally, there is an incompatibility of references to id with "x" and original id (without "x"):

Code:

Fragment of identifier is not defined.

Doitsu · 03-14-2018, 07:29 AM

@BeckyEbook: You can avoid this whole issue, if you create epub3 books, because the HTML5 standard allows ids that don't start with a letter.

If that is not an option for you, you can easily identify broken links using the built-in Sigil reports tool (Tools > Reports > Links).

BeckyEbook · 03-14-2018, 07:59 AM

@Doitsu: This is good information about epub3, but most of the files that go through my hands are still epub2.

The report is not perfect in this situation, because I see the same after validation in epubcheck.

It's just a simple replacement, which I can add to Saved Searches:

Code:

id="(\d)

to:

Code:

id="x\1

slowsmile · 03-14-2018, 08:07 AM

I'm not quite sure what you mean by "start.xhtml". Can you clarify what that file is - i.e. is it the cover file, toc file or a text file?

At the end of its run, the IDErrorCheck plugin should display all the results from the id error check in a final dialog. You also have the option of saving these results to a file if you want. Are you getting this dialog at the end of plugin run ?(see thumbnail below)

BeckyEbook · 03-14-2018, 08:26 AM

Quote:

Originally Posted by slowsmile

I'm not quite sure what you mean by "start.xhtml". Can you clarify what that file is - i.e. is it the cover file, toc file or a text file?

Start.xhtml is text file from sample epub file attached to my first post.

In log are only replaces in toc.xhtml file (after hashes).

10-19-2017, 08:36 PM	#6
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	Plugin Update: The plugin has been updated(v0.1.2): Changed handling of illegal first char digit id errors. These errors are now fixed by prepending(not substituting) an 'x' char into the id value string. Thanks to AlanHK & DiapDealer. Last edited by slowsmile; 10-19-2017 at 08:51 PM.*

10-25-2017, 07:07 PM	#7
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	Could someone please add this new plugin to the Sigil Plugin Index? Thanks in advance. Last edited by slowsmile; 10-25-2017 at 07:13 PM.

03-14-2018, 06:55 AM	#11
BeckyEbook Guru Posts: 704 Karma: 2180740 Join Date: Jan 2017 Location: Poland Device: Misc	Thanks for the clarification. I also understand "phantom colon", because in most cases this is the id that starts with a number. However ... Where do I get the "proper reasons for any id failure"? In IDErrorCheck Log are only records regarding changes made (in the example epub file it is the toc.xhtml file) Why in log has no records about the start.xhtml file and incorrect IDs? Information about the changes made is valuable, but the file still remains with incorrect identifiers. EpubCheck gives even more results, because not only does it provide: Code: Error while parsing file 'value of attribute" id "is invalid; must be an XML name without colons'. additionally, there is an incompatibility of references to id with "x" and original id (without "x"): Code: Fragment of identifier is not defined.

03-14-2018, 07:59 AM	#13
BeckyEbook Guru Posts: 704 Karma: 2180740 Join Date: Jan 2017 Location: Poland Device: Misc	@Doitsu: This is good information about epub3, but most of the files that go through my hands are still epub2. The report is not perfect in this situation, because I see the same after validation in epubcheck. It's just a simple replacement, which I can add to Saved Searches: Code: id="(\d) to: Code: id="x\1

03-14-2018, 08:07 AM	#14
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	I'm not quite sure what you mean by "start.xhtml". Can you clarify what that file is - i.e. is it the cover file, toc file or a text file? At the end of its run, the IDErrorCheck plugin should display all the results from the id error check in a final dialog. You also have the option of saving these results to a file if you want. Are you getting this dialog at the end of plugin run ?(see thumbnail below) Attached Thumbnails

10-19-2017, 04:54 AM	#2
AlanHK Guru Posts: 668 Karma: 929286 Join Date: Apr 2014 Device: PW-3, iPad, Android phone	Is this plugin's functionality now all included in your CustomCleanerPlus plugin? A note: you seem to change IDs beginning with a digit by replacing that digit with an x. Which will probably be fine, but could create duplicate IDs, e.g.: id="1" id="2" both become id="x" I manually corrected IDs by prepending X. There must be a limit to the length of an ID string, so I guess you should check if adding a character would push it over that if you were really being careful. Or just forget the original ID and regen them all.

10-19-2017, 08:19 AM	#4
DiapDealer Grand Sorcerer Posts: 27,628 Karma: 194727102 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	I think what he's saying is that replacing any first-digits in an id with an 'x' could possibly result in identical ids in the same html file. Prepending the 'x' (instead of swapping) would at least guarantee that already unique ids would stay that way.

10-19-2017, 07:07 PM	#5
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@DiapDealer...I'll try and put in the suggested change. This change will only apply to fixing the first char digit errors in the epub.

10-27-2017, 01:18 PM	#8
KevinH Sigil Developer Posts: 7,764 Karma: 5446592 Join Date: Nov 2009 Device: many	Just added it.

03-14-2018, 07:29 AM	#12
Doitsu Grand Sorcerer Posts: 5,612 Karma: 23187563 Join Date: Dec 2010 Device: Kindle PW2	@BeckyEbook: You can avoid this whole issue, if you create epub3 books, because the HTML5 standard allows ids that don't start with a letter. If that is not an option for you, you can easily identify broken links using the built-in Sigil reports tool (Tools > Reports > Links). Last edited by Doitsu; 03-14-2018 at 07:55 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[FileType Plugin] YVES Bible Plugin	ClashTheBunny	Plugins	27	01-16-2023 01:25 AM
Goodread Perception Expander plugin not shown on plugin list (kobo h2o)	www	KOReader	4	09-28-2017 10:34 AM
Problem with my ScrambleEbook plugin and the Plugin Updater tool	jackie_w	Development	14	01-19-2017 10:49 PM
Plugin not customizable: Plugin: HTML Output does not need customization	flyingfoxlee	Conversion	2	02-24-2012 02:24 AM
[GUI Plugin] Plugin Updater Deprecated	kiwidude	Plugins	159	06-19-2011 12:27 PM

Advert

Advert