Hi,
in the Comment field, I have a lot information formated in HTML (scrapped via the noosfere plugin, a french SF database)
for example : original title, serie and issue number, translator, cover page artist and more...
I would like to scrap some data into specific custom fields
So, as a newbie, I've tried to learn regex and to use python in Calibre, but I'm now lost ;(
example of comment field :
Code:
<div>
<p>Référence: <a href="https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238">https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238 </a></p>
<p>Couverture: <a href="https://images.noosfere.org/couv/b/bragelonne857-2015.jpg">https://images.noosfere.org/couv/b/bragelonne857-2015.jpg </a></p>
<p>
Titre original : <em>Judgment of Tears / Dracula Cha Cha Cha, 1998 </em>
Cycle : <a href="https://www.noosfere.org/livres/serie.asp?numserie=1507">Anno Dracula </a><a href="https://www.noosfere.org/livres/editionslivre.asp?numitem=4312"><<== </a>vol. 3
Traduction de <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=1588">Thierry ARSON </a>& <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=2147190331">Leslie DAMANT-JEANDEL </a>
Illustration de <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=2147190818">Noëmie CHEVALIER </a>
<a href="https://www.noosfere.org/livres/editeur.asp?numediteur=-24371077">BRAGELONNE </a>(Paris, France), coll. <a href="https://www.noosfere.org/livres/collection.asp?NumCollection=1975550487&numediteur=-24371077">L'Ombre </a>
</p>
with a first
re(val, pattern, replacement) function I can delete html balises
Code:
re(field('comment'),<.+?>,'' )
the result is
Code:
Référence: https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238
Couverture: https://images.noosfere.org/couv/b/bragelonne857-2015.jpg
Titre original : Judgment of Tears / Dracula Cha Cha Cha, 1998
Cycle : Anno Dracula <<== vol. 3
Traduction de Thierry ARSON & Leslie DAMANT-JEANDEL
Illustration de Noëmie CHEVALIER
BRAGELONNE (Paris, France), coll. L'Ombre
I have now a list of many items separated by end of line
\n in regex if I'm correct
so far, so good
to extract for example the cover page artist :
first, I've tried to use
sublist(previous result,7,8,\n) to extract from the previous result 'Illustration de Noëmie CHEVALIER' and then use again
re(val, pattern, replacement) to delete 'Illustration de '
the problem is that this item is not always at the 7th position (for example, a french stand-alone book will have no original title, no series and no translator )
is there any function to lookup the relevant item within the list beginning by ^'Illustration de ' (or something else for other data I want to retrieve) ?
I have think to use
switch(val, [pattern, value,]+ else_value) or
lookup(val, [pattern, field,]+ else_field) but I don't understand very well how to use these functions
perhaps I have to use a for loop like ??
Code:
for x in range(9):
if contains( sublist (previous result,x,x+1,\n)) , ^'Illustration de ',true, false)
sublist(previous result,x,x+1,\n)
fi
thanks for your help and advice !!
regards