MobileRead Forums - View Single Post - Lookup into a list, an item beginning by...

Marco24 · 09-30-2021, 08:44 AM

Hi,

in the Comment field, I have a lot information formated in HTML (scrapped via the noosfere plugin, a french SF database)
for example : original title, serie and issue number, translator, cover page artist and more...

I would like to scrap some data into specific custom fields

So, as a newbie, I've tried to learn regex and to use python in Calibre, but I'm now lost ;(

example of comment field :

Code:

<div> 
<p>Référence: <a href="https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238">https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238 </a></p> 
<p>Couverture: <a href="https://images.noosfere.org/couv/b/bragelonne857-2015.jpg">https://images.noosfere.org/couv/b/bragelonne857-2015.jpg </a></p> 
<p>
Titre original : <em>Judgment of Tears / Dracula Cha Cha Cha, 1998 </em>
Cycle : <a href="https://www.noosfere.org/livres/serie.asp?numserie=1507">Anno Dracula </a><a href="https://www.noosfere.org/livres/editionslivre.asp?numitem=4312">&lt;&lt;== </a>vol. 3 
Traduction de <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=1588">Thierry ARSON </a>&amp; <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=2147190331">Leslie DAMANT-JEANDEL </a>
Illustration de <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=2147190818">Noëmie CHEVALIER </a>
<a href="https://www.noosfere.org/livres/editeur.asp?numediteur=-24371077">BRAGELONNE </a>(Paris, France), coll. <a href="https://www.noosfere.org/livres/collection.asp?NumCollection=1975550487&amp;numediteur=-24371077">L'Ombre </a>
</p>

with a first re(val, pattern, replacement) function I can delete html balises

Code:

re(field('comment'),<.+?>,'' )

the result is

Code:

Référence: https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238  
Couverture: https://images.noosfere.org/couv/b/bragelonne857-2015.jpg  

Titre original : Judgment of Tears / Dracula Cha Cha Cha, 1998 
Cycle : Anno Dracula &lt;&lt;== vol. 3 
Traduction de Thierry ARSON &amp; Leslie DAMANT-JEANDEL 
Illustration de Noëmie CHEVALIER 
BRAGELONNE (Paris, France), coll. L'Ombre

I have now a list of many items separated by end of line \n in regex if I'm correct

so far, so good

to extract for example the cover page artist :

first, I've tried to use sublist(previous result,7,8,\n) to extract from the previous result 'Illustration de Noëmie CHEVALIER' and then use again re(val, pattern, replacement) to delete 'Illustration de '

the problem is that this item is not always at the 7th position (for example, a french stand-alone book will have no original title, no series and no translator )
is there any function to lookup the relevant item within the list beginning by ^'Illustration de ' (or something else for other data I want to retrieve) ?

I have think to use switch(val, [pattern, value,]+ else_value) or lookup(val, [pattern, field,]+ else_field) but I don't understand very well how to use these functions

perhaps I have to use a for loop like ??

Code:

for x in range(9):
  if contains( sublist (previous result,x,x+1,\n)) , ^'Illustration de ',true, false)
	sublist(previous result,x,x+1,\n)
	fi

thanks for your help and advice !!

regards

09-30-2021, 08:44 AM	#1
Marco24 Junior Member Posts: 9 Karma: 10 Join Date: Mar 2019 Location: Paris, France Device: Kobo Aura Edition 2	Lookup into a list, an item beginning by... Hi, in the Comment field, I have a lot information formated in HTML (scrapped via the noosfere plugin, a french SF database) for example : original title, serie and issue number, translator, cover page artist and more... I would like to scrap some data into specific custom fields So, as a newbie, I've tried to learn regex and to use python in Calibre, but I'm now lost ;( example of comment field : Code: <div> <p>Référence: <a href="https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238">https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238 </a></p> <p>Couverture: <a href="https://images.noosfere.org/couv/b/bragelonne857-2015.jpg">https://images.noosfere.org/couv/b/bragelonne857-2015.jpg </a></p> <p> Titre original : <em>Judgment of Tears / Dracula Cha Cha Cha, 1998 </em> Cycle : <a href="https://www.noosfere.org/livres/serie.asp?numserie=1507">Anno Dracula </a><a href="https://www.noosfere.org/livres/editionslivre.asp?numitem=4312"><<== </a>vol. 3 Traduction de <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=1588">Thierry ARSON </a>& <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=2147190331">Leslie DAMANT-JEANDEL </a> Illustration de <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=2147190818">Noëmie CHEVALIER </a> <a href="https://www.noosfere.org/livres/editeur.asp?numediteur=-24371077">BRAGELONNE </a>(Paris, France), coll. <a href="https://www.noosfere.org/livres/collection.asp?NumCollection=1975550487&numediteur=-24371077">L'Ombre </a> </p> with a first re(val, pattern, replacement) function I can delete html balises Code: re(field('comment'),<.+?>,'' ) the result is Code: Référence: https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238 Couverture: https://images.noosfere.org/couv/b/bragelonne857-2015.jpg Titre original : Judgment of Tears / Dracula Cha Cha Cha, 1998 Cycle : Anno Dracula <<== vol. 3 Traduction de Thierry ARSON & Leslie DAMANT-JEANDEL Illustration de Noëmie CHEVALIER BRAGELONNE (Paris, France), coll. L'Ombre I have now a list of many items separated by end of line \n in regex if I'm correct so far, so good to extract for example the cover page artist : first, I've tried to use sublist(previous result,7,8,\n) to extract from the previous result 'Illustration de Noëmie CHEVALIER' and then use again re(val, pattern, replacement) to delete 'Illustration de ' the problem is that this item is not always at the 7th position (for example, a french stand-alone book will have no original title, no series and no translator ) is there any function to lookup the relevant item within the list beginning by ^'Illustration de ' (or something else for other data I want to retrieve) ? I have think to use switch(val, [pattern, value,]+ else_value) or lookup(val, [pattern, field,]+ else_field) but I don't understand very well how to use these functions perhaps I have to use a for loop like ?? Code: for x in range(9): if contains( sublist (previous result,x,x+1,\n)) , ^'Illustration de ',true, false) sublist(previous result,x,x+1,\n) fi thanks for your help and advice !! regards