View Single Post
Old 01-15-2016, 07:15 PM   #1
roger64
Wizard
roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.roger64 ought to be getting tired of karma fortunes by now.
 
Posts: 2,625
Karma: 3120635
Join Date: Jan 2009
Device: Kindle PW3 (wifi)
Work on an unzipped EPUB xhtml file

Hi

I use Linux. If I unzip an EPUB, I can use a python script to work with a terminal on the .xhtml files and I can perform this way some tasks I am unable to do directly on the EPUB.

However, things do not appear to be as easy as that, specially for saving the output. Are there any recommendations to follow to modify safely these .xhtml files?

The goal is to modify one of these files and import it back in the EPUB. This is how the script is looking.

Spoiler:

#!/usr/bin/python3.5

import re, os, sys, glob

pref,suff='chapter','.xhtml'

# Recherche du fichier de numéro le plus élevé
fichiers=glob.glob('%s*[0-9]%s'%(pref,suff))
def num_fichier(fic):
k=re.search('%s(\d+)%s'%(pref,suff),fic)
if k: return int(k.group(1))
fichiers.sort(key=num_fichier)
der=num_fichier(fichiers[-1])
print("der=%d"%(der))

# On vérifie que le fichier de sortie n'existe pas déjà
out='%s%smodif%s'%(pref,der,suff)
if os.path.lexists(out):
sys.stderr.write("\nAttention : le fichier %s existe déjà\n\n"%out)
exit(1)

# Recherche de : href="fichier#ftnx" id="bodyftnx"
rec_lien=re.compile('(href="%s(?P<fil>\d+)%s#ftn(? P<id>\d+)"\s+id="bodyftn(?P=id)")'%(pref,suff))
# Recherche de : href="dernier_fichier#ftnx" id="bodyftnx"
rec_lien99=re.compile('(href="%s%s%s#ftn(?P<id>\d+ )"\s+id="bodyftn(?P=id)")'%(pref,der,suff))
# Recherche de : href="dernier_fichier#bodyftnx" id="ftnx"
lien99='(href="%s)%s(%s#bodyftn%%s"\s+id="ftn%%s)" '%(pref,der,suff)

# Liste des liens dans tous les fichiers sauf le dernier
liste_liens=[]
for num_fic in range(1,der):
try:
with open('%s%s%s'%(pref,num_fic,suff),'r') as fic: liens=rec_lien.findall(fic.read())
except FileNotFoundError: continue
for lien in liens:
num_fil=lien[1]
num_id=lien[2]
if num_fil=='%s'%der: liste_liens.append((num_fic,num_fil,num_id,lien[0]))

# fichier[id] : numéro du fichier qui contient le lien de numéro id
fichier={}
for num_fic,num_fil,num_id,lien in liste_liens:
# print("%s%-2s%s %2s %2s => %s"%(pref,num_fic,suff,num_fil,num_id,lien))
fichier[num_id]=str(num_fic)

# Modification des liens du dernier fichier
with open('%s%s%s'%(pref,der,suff),'r') as fic: f99=fic.read()
f99bis=f99
for id in fichier:
k=re.search(lien99%(id,id),f99bis)
if k:
f99bis=f99bis[:k.start(0)]+k.group(1)+fichier[id]+k.group(2)+f99bis[k.end(0):]

# Écriture du résultat
with open(out,'w') as fic: fic.write(f99bis)

Am I missing something obvious? Any practical recommendation appreciated..

Last edited by roger64; 01-15-2016 at 07:28 PM.
roger64 is offline   Reply With Quote