Work on an unzipped EPUB xhtml file

roger64 · 01-15-2016, 07:15 PM

Hi

I use Linux. If I unzip an EPUB, I can use a python script to work with a terminal on the .xhtml files and I can perform this way some tasks I am unable to do directly on the EPUB.

However, things do not appear to be as easy as that, specially for saving the output. Are there any recommendations to follow to modify safely these .xhtml files?

The goal is to modify one of these files and import it back in the EPUB. This is how the script is looking.

Spoiler:

Am I missing something obvious? Any practical recommendation appreciated..

Doitsu · 01-16-2016, 02:30 AM

AFAIK, a zipped epub archive needs be packed in certain sequence and the mimetype needs be added first and uncompressed.
The Sigil Plugin runner routines contain this Python 3 code that worked fine for me:

Code:

epub_mimetype = b'application/epub+zip'


def unzip_epub_to_dir(path_to_epub, destdir):
    f = open(pathof(path_to_epub), 'rb')
    sz = ZipFile(f)
    for name in sz.namelist():
        data = sz.read(name)
        name = name.replace("/", os.sep)
        filepath = os.path.join(destdir,name)
        basedir = os.path.dirname(filepath)
        if not os.path.isdir(basedir):
            os.makedirs(basedir)
        with open(filepath,'wb') as fp:
            fp.write(data)
    f.close()



def epub_zip_up_book_contents(ebook_path, epub_filepath):
    outzip = zipfile.ZipFile(pathof(epub_filepath), 'w')
    files = unipath.walk(ebook_path)
    if 'mimetype' in files:
        outzip.write(pathof(os.path.join(ebook_path, 'mimetype')), pathof('mimetype'), zipfile.ZIP_STORED)
    else:
        raise Exception('mimetype file is missing')
    files.remove('mimetype')
    for file in files:
        filepath = os.path.join(ebook_path, file)
        outzip.write(pathof(filepath),pathof(file),zipfile.ZIP_DEFLATED)
    outzip.close()

You can find the latest version (with all required imports, e.g. zipfile, os) on Github.

Since you're a Linux user, you could also use a shell script.

Alternatively, you could also run your Python code in Calibre Editor as a function or write a Sigil plugin.

This way all the packing and unpacking is handled by the hosting app.

roger64 · 01-16-2016, 02:55 AM

@Doitsu

Thanks for sharing this code.

I believed that I could import directly any .xhtml file from the Calibre editor...

Doitsu · 01-16-2016, 03:21 AM

@roger64: Since your Python code seems to have something to do with footnotes, also check out my AddIDs plugin.
If you have the same number of footnote references and footnotes (and both are in the same order) you might be able to use it assign the proper ids to footnote references and footnotes. (You'd run it twice: once for the footnote references and once for the footnote definitions.)

roger64 · 01-16-2016, 07:50 AM

Quote:

Originally Posted by Doitsu

@roger64: Since your Python code seems to have something to do with footnotes, also check out my AddIDs plugin.
If you have the same number of footnote references and footnotes (and both are in the same order) you might be able to use it assign the proper ids to footnote references and footnotes. (You'd run it twice: once for the footnote references and once for the footnote definitions.)

This script goes just beyond this precise step. If the links are broken (if only the return links), I have used two regex like you to recreate the links (first the return ones with bad chapter numbers, then the body ones which point to the return file). As for the return links, the number of the body chapter is missing or wrong. This script finally retrieves the missing body chapter numbers and writes them on a new xhtml file.

If you are interested, I can PM you a test file using this script. It's quite efficient and quick but for this defect... It maybe could be integrated in your plugin.

I had first though I could have done a Calibre function out of this, but as no support seems to be available and I don't know how to proceed...
https://www.mobileread.com/forums/sho...41&postcount=1

Doitsu · 01-16-2016, 08:09 AM

Quote:

Originally Posted by roger64

This script goes just beyond this precise step. If the links are broken (if only the return links), I have used two regex to recreate the links (first the return ones with bad chapter numbers, then the body ones which point to the return file). As for the return links, the number of the body chapter is missing or wrong. This script finally retrieves the missing body chapter numbers and writes them on a new xhtml file.

If your book contains, for example, 10 footnote references and 10 footnote definitions in a separate file, and all of them are tagged with unique classes, you could simply first override all existing footnote reference backlink ids with id="fnbl1..10" and then all footnote definitions with id="fn1..10". You'd then only need one or two regex searches to add the required links/backlinks.

Quote:

Originally Posted by roger64

If you are interested, I can PM you a test file using this script. It's quite efficient and quick but for this defect... It maybe could be integrated in your plugin.

I'm mostly interested in developing plugins that can be repeatedly used; developing plugins just to fix a one-off special problem simply doesn't make sense.

BTW, also check out the Sigil footnote plugin.

roger64 · 01-16-2016, 09:16 AM

I think we do not speak about the same thing. Your plugin checks the ids on both sides. This script checks the chapter numbers on the return side (on the body side they usually all point to the same chapter containing the notes so it's easy to check).

Even if the ids are correct, a wrong chapter number is enough to break the return link. So a safety check of the chapter numbers can confirm you than the links are working both sides. Ids and chapter numbers are the two variable elements of any link.

I asked a friend to write this script because I had to deal with some books with broken links. To put back the missing (or wrong) chapter numbers, I had to do it manually, jumping from one to another or...

I'll show you.

roger64 · 01-17-2016, 12:06 AM

Interested people can now follow this thread here:

https://www.mobileread.com/forums/sho...66&postcount=1

@Doitsu
Thanks for your expert help for debugging the script.

01-15-2016, 07:15 PM	#1
roger64 Wizard Posts: 2,624 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Work on an unzipped EPUB xhtml file Hi I use Linux. If I unzip an EPUB, I can use a python script to work with a terminal on the .xhtml files and I can perform this way some tasks I am unable to do directly on the EPUB. However, things do not appear to be as easy as that, specially for saving the output. Are there any recommendations to follow to modify safely these .xhtml files? The goal is to modify one of these files and import it back in the EPUB. This is how the script is looking. Spoiler: #!/usr/bin/python3.5 import re, os, sys, glob pref,suff='chapter','.xhtml' # Recherche du fichier de numéro le plus élevé fichiers=glob.glob('%s[0-9]%s'%(pref,suff)) def num_fichier(fic): k=re.search('%s(\d+)%s'%(pref,suff),fic) if k: return int(k.group(1)) fichiers.sort(key=num_fichier) der=num_fichier(fichiers[-1]) print("der=%d"%(der)) # On vérifie que le fichier de sortie n'existe pas déjà out='%s%smodif%s'%(pref,der,suff) if os.path.lexists(out): sys.stderr.write("\nAttention : le fichier %s existe déjà\n\n"%out) exit(1) # Recherche de : href="fichier#ftnx" id="bodyftnx" rec_lien=re.compile('(href="%s(?P<fil>\d+)%s#ftn(? P<id>\d+)"\s+id="bodyftn(?P=id)")'%(pref,suff)) # Recherche de : href="dernier_fichier#ftnx" id="bodyftnx" rec_lien99=re.compile('(href="%s%s%s#ftn(?P<id>\d+ )"\s+id="bodyftn(?P=id)")'%(pref,der,suff)) # Recherche de : href="dernier_fichier#bodyftnx" id="ftnx" lien99='(href="%s)%s(%s#bodyftn%%s"\s+id="ftn%%s)" '%(pref,der,suff) # Liste des liens dans tous les fichiers sauf le dernier liste_liens=[] for num_fic in range(1,der): try: with open('%s%s%s'%(pref,num_fic,suff),'r') as fic: liens=rec_lien.findall(fic.read()) except FileNotFoundError: continue for lien in liens: num_fil=lien[1] num_id=lien[2] if num_fil=='%s'%der: liste_liens.append((num_fic,num_fil,num_id,lien[0])) # fichier[id] : numéro du fichier qui contient le lien de numéro id fichier={} for num_fic,num_fil,num_id,lien in liste_liens: # print("%s%-2s%s %2s %2s => %s"%(pref,num_fic,suff,num_fil,num_id,lien)) fichier[num_id]=str(num_fic) # Modification des liens du dernier fichier with open('%s%s%s'%(pref,der,suff),'r') as fic: f99=fic.read() f99bis=f99 for id in fichier: k=re.search(lien99%(id,id),f99bis) if k: f99bis=f99bis[:k.start(0)]+k.group(1)+fichier[id]+k.group(2)+f99bis[k.end(0):] # Écriture du résultat with open(out,'w') as fic: fic.write(f99bis) Am I missing something obvious? Any practical recommendation appreciated.. Last edited by roger64; 01-15-2016 at 07:28 PM.*

01-16-2016, 09:16 AM	#7
roger64 Wizard Posts: 2,624 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	I think we do not speak about the same thing. Your plugin checks the ids on both sides. This script checks the chapter numbers on the return side (on the body side they usually all point to the same chapter containing the notes so it's easy to check). Even if the ids are correct, a wrong chapter number is enough to break the return link. So a safety check of the chapter numbers can confirm you than the links are working both sides. Ids and chapter numbers are the two variable elements of any link. I asked a friend to write this script because I had to deal with some books with broken links. To put back the missing (or wrong) chapter numbers, I had to do it manually, jumping from one to another or... I'll show you. Last edited by roger64; 01-16-2016 at 09:19 AM.

01-17-2016, 12:06 AM	#8
roger64 Wizard Posts: 2,624 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	Interested people can now follow this thread here: https://www.mobileread.com/forums/sho...66&postcount=1 @Doitsu Thanks for your expert help for debugging the script. Last edited by roger64; 01-17-2016 at 03:33 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Working with unzipped EPUB folders	schrijver	Sigil	5	10-14-2015 02:02 PM
XHTML file limit?	BobK99	Sigil	4	03-08-2013 05:38 AM
ncx file to html/xhtml file	javochase	Conversion	1	06-23-2011 06:57 PM
xhtml file name change	bobcdy	Sigil	11	10-23-2010 12:05 AM
Several xhtml/html to a single epub file help.	clowe1028	ePub	3	03-21-2010 03:47 AM

01-16-2016, 02:55 AM	#3
roger64 Wizard Posts: 2,624 Karma: 3120635 Join Date: Jan 2009 Device: Kindle PW3 (wifi)	@Doitsu Thanks for sharing this code. I believed that I could import directly any .xhtml file from the Calibre editor...

01-16-2016, 03:21 AM	#4
Doitsu Grand Sorcerer Posts: 5,727 Karma: 24031401 Join Date: Dec 2010 Device: Kindle PW2	@roger64: Since your Python code seems to have something to do with footnotes, also check out my AddIDs plugin. If you have the same number of footnote references and footnotes (and both are in the same order) you might be able to use it assign the proper ids to footnote references and footnotes. (You'd run it twice: once for the footnote references and once for the footnote definitions.)

Advert

Advert