Lookup into a list, an item beginning by...

Marco24 · 09-30-2021, 08:44 AM

Hi,

in the Comment field, I have a lot information formated in HTML (scrapped via the noosfere plugin, a french SF database)
for example : original title, serie and issue number, translator, cover page artist and more...

I would like to scrap some data into specific custom fields

So, as a newbie, I've tried to learn regex and to use python in Calibre, but I'm now lost ;(

example of comment field :

Code:

<div> 
<p>Référence: <a href="https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238">https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238 </a></p> 
<p>Couverture: <a href="https://images.noosfere.org/couv/b/bragelonne857-2015.jpg">https://images.noosfere.org/couv/b/bragelonne857-2015.jpg </a></p> 
<p>
Titre original : <em>Judgment of Tears / Dracula Cha Cha Cha, 1998 </em>
Cycle : <a href="https://www.noosfere.org/livres/serie.asp?numserie=1507">Anno Dracula </a><a href="https://www.noosfere.org/livres/editionslivre.asp?numitem=4312">&lt;&lt;== </a>vol. 3 
Traduction de <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=1588">Thierry ARSON </a>&amp; <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=2147190331">Leslie DAMANT-JEANDEL </a>
Illustration de <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=2147190818">Noëmie CHEVALIER </a>
<a href="https://www.noosfere.org/livres/editeur.asp?numediteur=-24371077">BRAGELONNE </a>(Paris, France), coll. <a href="https://www.noosfere.org/livres/collection.asp?NumCollection=1975550487&amp;numediteur=-24371077">L'Ombre </a>
</p>

with a first re(val, pattern, replacement) function I can delete html balises

Code:

re(field('comment'),<.+?>,'' )

the result is

Code:

Référence: https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238  
Couverture: https://images.noosfere.org/couv/b/bragelonne857-2015.jpg  

Titre original : Judgment of Tears / Dracula Cha Cha Cha, 1998 
Cycle : Anno Dracula &lt;&lt;== vol. 3 
Traduction de Thierry ARSON &amp; Leslie DAMANT-JEANDEL 
Illustration de Noëmie CHEVALIER 
BRAGELONNE (Paris, France), coll. L'Ombre

I have now a list of many items separated by end of line \n in regex if I'm correct

so far, so good

to extract for example the cover page artist :

first, I've tried to use sublist(previous result,7,8,\n) to extract from the previous result 'Illustration de Noëmie CHEVALIER' and then use again re(val, pattern, replacement) to delete 'Illustration de '

the problem is that this item is not always at the 7th position (for example, a french stand-alone book will have no original title, no series and no translator )
is there any function to lookup the relevant item within the list beginning by ^'Illustration de ' (or something else for other data I want to retrieve) ?

I have think to use switch(val, [pattern, value,]+ else_value) or lookup(val, [pattern, field,]+ else_field) but I don't understand very well how to use these functions

perhaps I have to use a for loop like ??

Code:

for x in range(9):
  if contains( sublist (previous result,x,x+1,\n)) , ^'Illustration de ',true, false)
	sublist(previous result,x,x+1,\n)
	fi

thanks for your help and advice !!

regards

chaley · 09-30-2021, 11:27 AM

One way to do it is with a loop. Something like

Code:

program:
	for i in '0,1,2,3,4,5,6,7,8,9':
		v = list_item($comments, i, '\n');
		if substr(v, 0, 19) == 'Illustration de ' then
# compute the result
			result = 'whatever';
			break
		fi
	rof

Another way is to use re() to strip out the information of interest. Something like this:

Code:

	re($comments, '(?ms)(?:^|.*\n)<p>bar(.*?)(\n|$)', '\1')

The re() function doesn't set the option to let .* match newlines which is why the flags (?ms) were added to the beginning.

The first method will be much slower but gives you more control over what happens if there isn't a match. The speed doesn't matter if you are doing the operation in search/replace or in the Action Chains plugin.

Marco24 · 10-01-2021, 05:56 AM

Hi chaley

thanks a lot for your help !!

but the list_item function doesn't work

indeed, I've tried this :

Code:

program:
list_item(re(field('comments'),'<.+?>','') , 0, '\n')

in the following data, there is \n but Calibre don't want to identify them as separator as requested in the list_item function

I've tried with r'\n' ''\n'', r''\n'' r"[\n]" but none of them work

Code:

Référence: https://www.noosfere.org/livres/niourf.asp?numlivre=2146572761 \nCouverture: https://images.noosfere.org/couv/a/atalante420-2009.jpg \nTitre original : Ausgebrannt, 2007 Première parution : Bastei-Lübbe, 2007 Traduction de Frédéric WEINMANN Illustration de Matthias KULKA

chaley · 10-01-2021, 06:38 AM

Sorry, I forgot that the template language parser doesn't handle escaped characters. Use the function character('newline') instead of '\n'.

You might be happier/more successful implementing this in python as a user defined template function.

Marco24 · 10-01-2021, 09:26 AM

it works : thanks a lot !!

09-30-2021, 08:44 AM	#1
Marco24 Junior Member Posts: 9 Karma: 10 Join Date: Mar 2019 Location: Paris, France Device: Kobo Aura Edition 2	Lookup into a list, an item beginning by... Hi, in the Comment field, I have a lot information formated in HTML (scrapped via the noosfere plugin, a french SF database) for example : original title, serie and issue number, translator, cover page artist and more... I would like to scrap some data into specific custom fields So, as a newbie, I've tried to learn regex and to use python in Calibre, but I'm now lost ;( example of comment field : Code: <div> <p>Référence: <a href="https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238">https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238 </a></p> <p>Couverture: <a href="https://images.noosfere.org/couv/b/bragelonne857-2015.jpg">https://images.noosfere.org/couv/b/bragelonne857-2015.jpg </a></p> <p> Titre original : <em>Judgment of Tears / Dracula Cha Cha Cha, 1998 </em> Cycle : <a href="https://www.noosfere.org/livres/serie.asp?numserie=1507">Anno Dracula </a><a href="https://www.noosfere.org/livres/editionslivre.asp?numitem=4312"><<== </a>vol. 3 Traduction de <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=1588">Thierry ARSON </a>& <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=2147190331">Leslie DAMANT-JEANDEL </a> Illustration de <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=2147190818">Noëmie CHEVALIER </a> <a href="https://www.noosfere.org/livres/editeur.asp?numediteur=-24371077">BRAGELONNE </a>(Paris, France), coll. <a href="https://www.noosfere.org/livres/collection.asp?NumCollection=1975550487&numediteur=-24371077">L'Ombre </a> </p> with a first re(val, pattern, replacement) function I can delete html balises Code: re(field('comment'),<.+?>,'' ) the result is Code: Référence: https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238 Couverture: https://images.noosfere.org/couv/b/bragelonne857-2015.jpg Titre original : Judgment of Tears / Dracula Cha Cha Cha, 1998 Cycle : Anno Dracula <<== vol. 3 Traduction de Thierry ARSON & Leslie DAMANT-JEANDEL Illustration de Noëmie CHEVALIER BRAGELONNE (Paris, France), coll. L'Ombre I have now a list of many items separated by end of line \n in regex if I'm correct so far, so good to extract for example the cover page artist : first, I've tried to use sublist(previous result,7,8,\n) to extract from the previous result 'Illustration de Noëmie CHEVALIER' and then use again re(val, pattern, replacement) to delete 'Illustration de ' the problem is that this item is not always at the 7th position (for example, a french stand-alone book will have no original title, no series and no translator ) is there any function to lookup the relevant item within the list beginning by ^'Illustration de ' (or something else for other data I want to retrieve) ? I have think to use switch(val, [pattern, value,]+ else_value) or lookup(val, [pattern, field,]+ else_field) but I don't understand very well how to use these functions perhaps I have to use a for loop like ?? Code: for x in range(9): if contains( sublist (previous result,x,x+1,\n)) , ^'Illustration de ',true, false) sublist(previous result,x,x+1,\n) fi thanks for your help and advice !! regards

09-30-2021, 11:27 AM	#2
chaley Grand Sorcerer Posts: 12,444 Karma: 8012886 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	One way to do it is with a loop. Something like Code: program: for i in '0,1,2,3,4,5,6,7,8,9': v = list_item($comments, i, '\n'); if substr(v, 0, 19) == 'Illustration de ' then # compute the result result = 'whatever'; break fi rof Another way is to use re() to strip out the information of interest. Something like this: Code: re($comments, '(?ms)(?:^\|.\n)<p>bar(.?)(\n\|$)', '\1') The re() function doesn't set the option to let .* match newlines which is why the flags (?ms) were added to the beginning. The first method will be much slower but gives you more control over what happens if there isn't a match. The speed doesn't matter if you are doing the operation in search/replace or in the Action Chains plugin.

10-01-2021, 05:56 AM	#3
Marco24 Junior Member Posts: 9 Karma: 10 Join Date: Mar 2019 Location: Paris, France Device: Kobo Aura Edition 2	Hi chaley thanks a lot for your help !! but the list_item function doesn't work indeed, I've tried this : Code: program: list_item(re(field('comments'),'<.+?>','') , 0, '\n') in the following data, there is \n but Calibre don't want to identify them as separator as requested in the list_item function I've tried with r'\n' ''\n'', r''\n'' r"[\n]" but none of them work Code: Référence: https://www.noosfere.org/livres/niourf.asp?numlivre=2146572761 \nCouverture: https://images.noosfere.org/couv/a/atalante420-2009.jpg \nTitre original : Ausgebrannt, 2007 Première parution : Bastei-Lübbe, 2007 Traduction de Frédéric WEINMANN Illustration de Matthias KULKA

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
When you look at a list of your highlights in Books, does it go to beginning of book?	2scre	Apple Devices	2	10-24-2020 03:41 PM
Selecting item on the list...	aleksei_iv	Calibre	6	11-09-2017 12:04 PM
How to remove item from recently read list	Acharn	Calibre	6	01-03-2017 09:03 AM
Windows 7 Jump List has just one item	Starko	Calibre	0	09-22-2011 03:22 PM
Output list of tag-item data?	unboggling	Library Management	0	09-20-2011 08:23 AM

10-01-2021, 06:38 AM	#4
chaley Grand Sorcerer Posts: 12,444 Karma: 8012886 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	Sorry, I forgot that the template language parser doesn't handle escaped characters. Use the function character('newline') instead of '\n'. You might be happier/more successful implementing this in python as a user defined template function.

10-01-2021, 09:26 AM	#5
Marco24 Junior Member Posts: 9 Karma: 10 Join Date: Mar 2019 Location: Paris, France Device: Kobo Aura Edition 2	it works : thanks a lot !!

Advert

Advert