Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 08-14-2022, 11:51 PM   #1
lizzie1170
Member
lizzie1170 began at the beginning.
 
lizzie1170's Avatar
 
Posts: 12
Karma: 10
Join Date: Jul 2022
Device: none
Extract text from selected books, convert them to tags, and add them to metadata.

I created a python code in VScode that allows me to perform text searches within an epub book, these searches consist of matching the text of the book with regular expressions. These regular expressions come from patterns that I formulated for the tags in my library. I have already managed to get over 400 tags this way and I have a custom column for them, I add the @ symbol at the beginning to differentiate them from tags downloaded from other sources. I have 3000+ books and I want each of them to be attacked by these 400+ regular expressions.

I need help because my code only contemplates the search in a single book and what I want to configure is:
** Run the code on selected books from my library (books_ids).
** Found tags are added to the metadata.
** Add a verification tag confirming that the book was processed.

The truth is that my knowledge of python is very poor, I only learned about regular expressions thanks to Calibre.

Code:
import re
import ast
from epub_conversion.utils import open_book, convert_epub_to_lines
import colorama
colorama.init()

book = open_book("Cthulhu Mythos.epub")
lines = convert_epub_to_lines(book)
with open("test_dict.txt", "r") as data:
    tags_dict = ast.literal_eval(data.read())

print(colorama.Back.YELLOW + 'Matches(regex - book text):',colorama.Style.RESET_ALL)
temp = []
res = dict()
for line in lines:
    for key,value in tags_dict.items():
         if re.search(rf'{value}', line):
            if value not in temp:
                temp.append(value)
                res[key] = value                
                regex = re.compile(value) 
                match_array = regex.finditer(line) 
                match_list = list(match_array)
                for m in match_list:
                    print(colorama.Fore.MAGENTA + key, ":",colorama.Style.RESET_ALL + m.group())

print('\n',colorama.Back.YELLOW + 'Found tags:',colorama.Style.RESET_ALL)
temp = []
res = dict()
for line in lines:
    for key,value in tags_dict.items():
         if re.search(rf'{value}', line):
            if value not in temp:
                temp.append(value)
                res[key] = value
                print(colorama.Fore.GREEN + key, end=", ")                              

print('\n\n' + colorama.Back.YELLOW + "N° found tags:",colorama.Style.RESET_ALL, len(temp))

I show you in images, what I need to execute.

I'ld appreciate any help with the code, thank you very much.
Attached Thumbnails
Click image for larger version

Name:	print_code.PNG
Views:	148
Size:	40.0 KB
ID:	195870   Click image for larger version

Name:	Execute.png
Views:	137
Size:	130.9 KB
ID:	195872  
Attached Files
File Type: txt test_dict.txt (383 Bytes, 79 views)
lizzie1170 is offline   Reply With Quote
Old 08-15-2022, 03:11 AM   #2
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
There are a lot of plugins that act on multiple books in the library, read them somehow and then update the metadata. I think a good example of this would be the Count Pages plugin. That actually reads the books, extracts the text and processes it (calculates various statistics) and then update the metadata with those values. That fits fairly well with what you are doing. I would suggest looking at that plugin and then asking questions about what you don't understand and need to do.

Alternatively, it might be possible do this using the Actions Chains plugin. It might be able to run the commands and it has ways to update the metadata for a book. I haven't explored it, so I am not sure if it would work.

Another way is to do this externally using various calibre commands including calibredb. With that, you could use calibredb to search for the books to be changed, convert it to text externally to calibre, process the text and then use calibredb to update metadata.
davidfor is offline   Reply With Quote
Advert
Old 08-15-2022, 11:06 AM   #3
capink
Wizard
capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.
 
Posts: 1,196
Karma: 1995558
Join Date: Aug 2015
Device: Kindle
Quote:
Originally Posted by davidfor View Post
Alternatively, it might be possible do this using the Actions Chains plugin. It might be able to run the commands and it has ways to update the metadata for a book. I haven't explored it, so I am not sure if it would work.
As David points out, you can run your code through the Action Chains plugin to process multiple books. There is an action called "Run Python Code" that allows you to do just that:
  • Create a chain that contains a single "Run Python Code" action.
  • Click the settings button next to the action, a window will pop up where you can copy the following code:
    Code:
    def tags_from_epub(path_to_epub):
       # the code you posted in your post should go
       # here. The run() function below will pass
       # path_to_epub, so modify it use this instead
       # of the hardcoded epub path
    
    def run(gui, settings, chain):
        db = gui.current_db
        for book_id in chain.scope().get_book_ids():
            fmts = [ fmt.strip() for fmt in db.formats(book_id, index_is_id=True).split(',') ]
            if 'EPUB' in fmts:
                path_to_epub = db.format_abspath(book_id, 'EPUB', index_is_id=True)
                tags_from_epub(path_to_epub)
    As you can see, you have to include your code in the the first function and modify it to use path_to_epub
  • Now, you should find a menu entry with the name of your chain in the Action Chains menu. You can even bind it to a keyboard shortcut if you want.

Edit: The chain you created will only act on books selected in the list view.

Last edited by capink; 08-15-2022 at 11:09 AM.
capink is offline   Reply With Quote
Old 08-15-2022, 06:59 PM   #4
lizzie1170
Member
lizzie1170 began at the beginning.
 
lizzie1170's Avatar
 
Posts: 12
Karma: 10
Join Date: Jul 2022
Device: none
Quote:
Originally Posted by capink View Post
As David points out, you can run your code through the Action Chains plugin to process multiple books. There is an action called "Run Python Code" that allows you to do just that:
  • Create a chain that contains a single "Run Python Code" action.
  • Click the settings button next to the action, a window will pop up where you can copy the following code:
    Code:
    def tags_from_epub(path_to_epub):
       # the code you posted in your post should go
       # here. The run() function below will pass
       # path_to_epub, so modify it use this instead
       # of the hardcoded epub path
    
    def run(gui, settings, chain):
        db = gui.current_db
        for book_id in chain.scope().get_book_ids():
            fmts = [ fmt.strip() for fmt in db.formats(book_id, index_is_id=True).split(',') ]
            if 'EPUB' in fmts:
                path_to_epub = db.format_abspath(book_id, 'EPUB', index_is_id=True)
                tags_from_epub(path_to_epub)
    As you can see, you have to include your code in the the first function and modify it to use path_to_epub
  • Now, you should find a menu entry with the name of your chain in the Action Chains menu. You can even bind it to a keyboard shortcut if you want.

Edit: The chain you created will only act on books selected in the list view.
Thanks for the answer for both Senseis , but I have a little problem I have Calibre Portable on Windows 7 where version 3.48 is the latest. Action Chain requires version 5.25.0 or later. Do you know of any previous version that is compatible with Windows 7?
lizzie1170 is offline   Reply With Quote
Old 08-17-2022, 08:06 AM   #5
capink
Wizard
capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.
 
Posts: 1,196
Karma: 1995558
Join Date: Aug 2015
Device: Kindle
No. Action Chains only works with calibre 5+.
capink is offline   Reply With Quote
Advert
Old 08-17-2022, 08:41 AM   #6
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 79,740
Karma: 145864619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Given that you'll miss out as more and more plugins are moving on, I highly suggest upgrading to Windows 10 if you can.
JSWolf is offline   Reply With Quote
Old 08-17-2022, 09:35 AM   #7
Quoth
Still reading
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 14,010
Karma: 105092227
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper
Quote:
Originally Posted by JSWolf View Post
Given that you'll miss out as more and more plugins are moving on, I highly suggest upgrading to Windows 10 if you can.
Or Windows 11, Mac M2 with Mac OS or Linux 64 bit LTS versions.
Quoth is offline   Reply With Quote
Old 08-17-2022, 08:49 PM   #8
lizzie1170
Member
lizzie1170 began at the beginning.
 
lizzie1170's Avatar
 
Posts: 12
Karma: 10
Join Date: Jul 2022
Device: none
Quote:
Originally Posted by capink View Post
As David points out, you can run your code through the Action Chains plugin to process multiple books. There is an action called "Run Python Code" that allows you to do just that:
  • Create a chain that contains a single "Run Python Code" action.
  • Click the settings button next to the action, a window will pop up where you can copy the following code:
    Code:
    def tags_from_epub(path_to_epub):
       # the code you posted in your post should go
       # here. The run() function below will pass
       # path_to_epub, so modify it use this instead
       # of the hardcoded epub path
    
    def run(gui, settings, chain):
        db = gui.current_db
        for book_id in chain.scope().get_book_ids():
            fmts = [ fmt.strip() for fmt in db.formats(book_id, index_is_id=True).split(',') ]
            if 'EPUB' in fmts:
                path_to_epub = db.format_abspath(book_id, 'EPUB', index_is_id=True)
                tags_from_epub(path_to_epub)
    As you can see, you have to include your code in the the first function and modify it to use path_to_epub
  • Now, you should find a menu entry with the name of your chain in the Action Chains menu. You can even bind it to a keyboard shortcut if you want.

Edit: The chain you created will only act on books selected in the list view.
I followed the recommendations but I'm having trouble reading the content of epubs.
Attached Thumbnails
Click image for larger version

Name:	Error_module.PNG
Views:	95
Size:	58.9 KB
ID:	195943  
lizzie1170 is offline   Reply With Quote
Old 08-17-2022, 09:30 PM   #9
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by lizzie1170 View Post
I followed the recommendations but I'm having trouble reading the content of epubs.
You need to have a Python module called "epub_conversion.utils" available to you. You have that in your original script. Where is it coming from? It does not look like a calibre module and I cannot find "convert_epub_to_lines" in the calibre source. You will either need to add this module so that you can see it when running in calibre. Or change the code to use calibre functions. I know the Count Pages plugin does this (extract the text from an epub), so you can look at that for how to do it.
davidfor is offline   Reply With Quote
Old 08-17-2022, 10:18 PM   #10
lizzie1170
Member
lizzie1170 began at the beginning.
 
lizzie1170's Avatar
 
Posts: 12
Karma: 10
Join Date: Jul 2022
Device: none
Quote:
Originally Posted by davidfor View Post
You need to have a Python module called "epub_conversion.utils" available to you. You have that in your original script. Where is it coming from? It does not look like a calibre module and I cannot find "convert_epub_to_lines" in the calibre source. You will either need to add this module so that you can see it when running in calibre. Or change the code to use calibre functions. I know the Count Pages plugin does this (extract the text from an epub), so you can look at that for how to do it.
I wrote the code initially just for an epub book in Visual Studio and it worked for me. Applying code with Calibre functions is another multiverse, that's why I need help putting it in the context of Calibre.
lizzie1170 is offline   Reply With Quote
Old 08-19-2022, 02:21 AM   #11
lizzie1170
Member
lizzie1170 began at the beginning.
 
lizzie1170's Avatar
 
Posts: 12
Karma: 10
Join Date: Jul 2022
Device: none
Quote:
Originally Posted by davidfor View Post
You need to have a Python module called "epub_conversion.utils" available to you. You have that in your original script. Where is it coming from? It does not look like a calibre module and I cannot find "convert_epub_to_lines" in the calibre source. You will either need to add this module so that you can see it when running in calibre. Or change the code to use calibre functions. I know the Count Pages plugin does this (extract the text from an epub), so you can look at that for how to do it.
I tried to simulate the "Count Pages" code to read the content of Epubs, when I run the code it tells me: "typeerror 'function' object is not iterable"

Code:
import re
from calibre_plugins.action_chains.actions.base import ChainAction
RE_HTML_BODY = re.compile(u'<body[^>]*>(.*)</body>', re.UNICODE | re.DOTALL | re.IGNORECASE)
with open("test_dict.txt", "r") as f:
    tags_dict = f.read()

def _extract_body_text(data):
    '''Get the body text of this html content wit any html tags stripped'''
    body = RE_HTML_BODY.findall(data)

def tags_from_epub(path_to_epub):
    temp = []
    res = dict()
    for line in _extract_body_text:
        for key,value in tags_dict.items():
         if re.search(rf'{value}', line):
            if value not in temp:
                temp.append(value)
                res[key] = value                
                regex = re.compile(value) 
                match_array = regex.finditer(line) 
                match_list = list(match_array)
                for m in match_list:
                    print(key, ":",m.group())
    
def run(gui, settings, chain):
    db = gui.current_db
    for book_id in chain.scope().get_book_ids():
        fmts = [ fmt.strip() for fmt in db.formats(book_id, index_is_id=True).split(',') ]
        if 'EPUB' in fmts:
            path_to_epub = db.format_abspath(book_id, 'EPUB', index_is_id=True)
            tags_from_epub(path_to_epub)
lizzie1170 is offline   Reply With Quote
Old 08-19-2022, 06:31 AM   #12
capink
Wizard
capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.capink ought to be getting tired of karma fortunes by now.
 
Posts: 1,196
Karma: 1995558
Join Date: Aug 2015
Device: Kindle
Quote:
Originally Posted by lizzie1170 View Post
I wrote the code initially just for an epub book in Visual Studio and it worked for me. Applying code with Calibre functions is another multiverse, that's why I need help putting it in the context of Calibre.
Calibre comes with its own copy of python. As such, you cannot (directly) import modules from your system python. You have two options:
  • Follow davidfor suggestion and develop your solution using code from calibre libraries.
  • Modify sys.path to include the path(s) from which you want to import the original module. You can search the forum for examples. I am not sure what adverse effects this might lead to, as I have never done done it before. So, you have to research this yourself.
capink is offline   Reply With Quote
Old 08-19-2022, 08:20 AM   #13
davidfor
Grand Sorcerer
davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.davidfor ought to be getting tired of karma fortunes by now.
 
Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
Quote:
Originally Posted by lizzie1170 View Post
I tried to simulate the "Count Pages" code to read the content of Epubs, when I run the code it tells me: "typeerror 'function' object is not iterable"

Code:
import re
from calibre_plugins.action_chains.actions.base import ChainAction
RE_HTML_BODY = re.compile(u'<body[^>]*>(.*)</body>', re.UNICODE | re.DOTALL | re.IGNORECASE)
with open("test_dict.txt", "r") as f:
    tags_dict = f.read()

def _extract_body_text(data):
    '''Get the body text of this html content wit any html tags stripped'''
    body = RE_HTML_BODY.findall(data)

def tags_from_epub(path_to_epub):
    temp = []
    res = dict()
    for line in _extract_body_text:
        for key,value in tags_dict.items():
         if re.search(rf'{value}', line):
            if value not in temp:
                temp.append(value)
                res[key] = value                
                regex = re.compile(value) 
                match_array = regex.finditer(line) 
                match_list = list(match_array)
                for m in match_list:
                    print(key, ":",m.group())
    
def run(gui, settings, chain):
    db = gui.current_db
    for book_id in chain.scope().get_book_ids():
        fmts = [ fmt.strip() for fmt in db.formats(book_id, index_is_id=True).split(',') ]
        if 'EPUB' in fmts:
            path_to_epub = db.format_abspath(book_id, 'EPUB', index_is_id=True)
            tags_from_epub(path_to_epub)
Well, you get that error because you didn't actually call the method. "_extract_body_text" appears to be a method that takes a string of some sort. But, when you used it, you treated it as something else.

And that doesn't look anything like what Page Count does. It will open the epub as an iterator, then iterate through the files in the spine, extract the text from each of them and combine them into a big long chunk of text. Then it process that. You have passed "path_to_epub" into your method, but, never actually used it. From the Count Pages plugin, you need to look at statistic.py and follow the flow starting with "get_word_count"
davidfor is offline   Reply With Quote
Old 08-19-2022, 02:13 PM   #14
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 79,740
Karma: 145864619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by Quoth View Post
Or Windows 11, Mac M2 with Mac OS or Linux 64 bit LTS versions.
The reason I didn't mention Windows 11 is that a lot of computers that ran Windows 7 back when it was new would probably not be able to upgrade to Windows 11 Due to the security requirements required in the CPU.
JSWolf is offline   Reply With Quote
Old 08-19-2022, 06:53 PM   #15
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 21,717
Karma: 29711016
Join Date: Mar 2012
Location: Sydney Australia
Device: none
I can't run Windows 11 on my Dell XPS 8920 (i7-7700 @ 3.6 GHz 16 GB RAM) that came with Windows 10 in 2017.
BetterRed is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
How can I bulk-delete a number of selected tags from all books in my library? droopy Library Management 1 08-09-2020 06:24 PM
How are tags selected between multiple metadata sources? Isomorpheus Library Management 3 10-19-2019 01:29 PM
HTML Metadata add Tags? skb Conversion 5 07-16-2019 07:24 AM
Help Please- Add and Convert Books and Download Metadata not working??? gorgeousbird Calibre 5 08-14-2012 12:31 AM
ADD Books & extract tags from title? johnb0647 Calibre 3 01-08-2011 05:36 PM


All times are GMT -4. The time now is 07:50 PM.


MobileRead.com is a privately owned, operated and funded community.