Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Library Management

Notices

Reply
 
Thread Tools Search this Thread
Old 09-30-2021, 08:44 AM   #1
Marco24
Junior Member
Marco24 began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Mar 2019
Location: Paris, France
Device: Kobo Aura Edition 2
Lookup into a list, an item beginning by...

Hi,

in the Comment field, I have a lot information formated in HTML (scrapped via the noosfere plugin, a french SF database)
for example : original title, serie and issue number, translator, cover page artist and more...

I would like to scrap some data into specific custom fields

So, as a newbie, I've tried to learn regex and to use python in Calibre, but I'm now lost ;(

example of comment field :
Code:
<div> 
<p>Référence: <a href="https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238">https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238 </a></p> 
<p>Couverture: <a href="https://images.noosfere.org/couv/b/bragelonne857-2015.jpg">https://images.noosfere.org/couv/b/bragelonne857-2015.jpg </a></p> 
<p>
Titre original : <em>Judgment of Tears / Dracula Cha Cha Cha, 1998 </em>
Cycle : <a href="https://www.noosfere.org/livres/serie.asp?numserie=1507">Anno Dracula </a><a href="https://www.noosfere.org/livres/editionslivre.asp?numitem=4312">&lt;&lt;== </a>vol. 3 
Traduction de <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=1588">Thierry ARSON </a>&amp; <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=2147190331">Leslie DAMANT-JEANDEL </a>
Illustration de <a href="https://www.noosfere.org/livres/auteur.asp?NumAuteur=2147190818">Noëmie CHEVALIER </a>
<a href="https://www.noosfere.org/livres/editeur.asp?numediteur=-24371077">BRAGELONNE </a>(Paris, France), coll. <a href="https://www.noosfere.org/livres/collection.asp?NumCollection=1975550487&amp;numediteur=-24371077">L'Ombre </a>
</p>

with a first re(val, pattern, replacement) function I can delete html balises

Code:
re(field('comment'),<.+?>,'' )
the result is

Code:
Référence: https://www.noosfere.org/livres/niourf.asp?numlivre=2146591238  
Couverture: https://images.noosfere.org/couv/b/bragelonne857-2015.jpg  

Titre original : Judgment of Tears / Dracula Cha Cha Cha, 1998 
Cycle : Anno Dracula &lt;&lt;== vol. 3 
Traduction de Thierry ARSON &amp; Leslie DAMANT-JEANDEL 
Illustration de Noëmie CHEVALIER 
BRAGELONNE (Paris, France), coll. L'Ombre

I have now a list of many items separated by end of line \n in regex if I'm correct

so far, so good

to extract for example the cover page artist :

first, I've tried to use sublist(previous result,7,8,\n) to extract from the previous result 'Illustration de Noëmie CHEVALIER' and then use again re(val, pattern, replacement) to delete 'Illustration de '

the problem is that this item is not always at the 7th position (for example, a french stand-alone book will have no original title, no series and no translator )
is there any function to lookup the relevant item within the list beginning by ^'Illustration de ' (or something else for other data I want to retrieve) ?

I have think to use switch(val, [pattern, value,]+ else_value) or lookup(val, [pattern, field,]+ else_field) but I don't understand very well how to use these functions

perhaps I have to use a for loop like ??
Code:
for x in range(9):
  if contains( sublist (previous result,x,x+1,\n)) , ^'Illustration de ',true, false)
	sublist(previous result,x,x+1,\n)
	fi
thanks for your help and advice !!

regards
Marco24 is offline   Reply With Quote
Old 09-30-2021, 11:27 AM   #2
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 12,444
Karma: 8012886
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
One way to do it is with a loop. Something like
Code:
program:
	for i in '0,1,2,3,4,5,6,7,8,9':
		v = list_item($comments, i, '\n');
		if substr(v, 0, 19) == 'Illustration de ' then
# compute the result
			result = 'whatever';
			break
		fi
	rof
Another way is to use re() to strip out the information of interest. Something like this:
Code:
	re($comments, '(?ms)(?:^|.*\n)<p>bar(.*?)(\n|$)', '\1')
The re() function doesn't set the option to let .* match newlines which is why the flags (?ms) were added to the beginning.

The first method will be much slower but gives you more control over what happens if there isn't a match. The speed doesn't matter if you are doing the operation in search/replace or in the Action Chains plugin.
chaley is offline   Reply With Quote
Advert
Old 10-01-2021, 05:56 AM   #3
Marco24
Junior Member
Marco24 began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Mar 2019
Location: Paris, France
Device: Kobo Aura Edition 2
Hi chaley

thanks a lot for your help !!

but the list_item function doesn't work

indeed, I've tried this :

Code:
program:
list_item(re(field('comments'),'<.+?>','') , 0, '\n')
in the following data, there is \n but Calibre don't want to identify them as separator as requested in the list_item function

I've tried with r'\n' ''\n'', r''\n'' r"[\n]" but none of them work

Code:
Référence: https://www.noosfere.org/livres/niourf.asp?numlivre=2146572761 \nCouverture: https://images.noosfere.org/couv/a/atalante420-2009.jpg \nTitre original : Ausgebrannt, 2007 Première parution : Bastei-Lübbe, 2007 Traduction de Frédéric WEINMANN Illustration de Matthias KULKA
Marco24 is offline   Reply With Quote
Old 10-01-2021, 06:38 AM   #4
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 12,444
Karma: 8012886
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Sorry, I forgot that the template language parser doesn't handle escaped characters. Use the function character('newline') instead of '\n'.

You might be happier/more successful implementing this in python as a user defined template function.
chaley is offline   Reply With Quote
Old 10-01-2021, 09:26 AM   #5
Marco24
Junior Member
Marco24 began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Mar 2019
Location: Paris, France
Device: Kobo Aura Edition 2
it works : thanks a lot !!
Marco24 is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
When you look at a list of your highlights in Books, does it go to beginning of book? 2scre Apple Devices 2 10-24-2020 03:41 PM
Selecting item on the list... aleksei_iv Calibre 6 11-09-2017 12:04 PM
How to remove item from recently read list Acharn Calibre 6 01-03-2017 09:03 AM
Windows 7 Jump List has just one item Starko Calibre 0 09-22-2011 03:22 PM
Output list of tag-item data? unboggling Library Management 0 09-20-2011 08:23 AM


All times are GMT -4. The time now is 04:12 AM.


MobileRead.com is a privately owned, operated and funded community.