Extract text from selected books, convert them to tags, and add them to metadata.

lizzie1170 · 08-14-2022, 11:51 PM

I created a python code in VScode that allows me to perform text searches within an epub book, these searches consist of matching the text of the book with regular expressions. These regular expressions come from patterns that I formulated for the tags in my library. I have already managed to get over 400 tags this way and I have a custom column for them, I add the @ symbol at the beginning to differentiate them from tags downloaded from other sources. I have 3000+ books and I want each of them to be attacked by these 400+ regular expressions.

I need help because my code only contemplates the search in a single book and what I want to configure is:
** Run the code on selected books from my library (books_ids).
** Found tags are added to the metadata.
** Add a verification tag confirming that the book was processed.

The truth is that my knowledge of python is very poor, I only learned about regular expressions thanks to Calibre.

Code:

import re
import ast
from epub_conversion.utils import open_book, convert_epub_to_lines
import colorama
colorama.init()

book = open_book("Cthulhu Mythos.epub")
lines = convert_epub_to_lines(book)
with open("test_dict.txt", "r") as data:
    tags_dict = ast.literal_eval(data.read())

print(colorama.Back.YELLOW + 'Matches(regex - book text):',colorama.Style.RESET_ALL)
temp = []
res = dict()
for line in lines:
    for key,value in tags_dict.items():
         if re.search(rf'{value}', line):
            if value not in temp:
                temp.append(value)
                res[key] = value                
                regex = re.compile(value) 
                match_array = regex.finditer(line) 
                match_list = list(match_array)
                for m in match_list:
                    print(colorama.Fore.MAGENTA + key, ":",colorama.Style.RESET_ALL + m.group())

print('\n',colorama.Back.YELLOW + 'Found tags:',colorama.Style.RESET_ALL)
temp = []
res = dict()
for line in lines:
    for key,value in tags_dict.items():
         if re.search(rf'{value}', line):
            if value not in temp:
                temp.append(value)
                res[key] = value
                print(colorama.Fore.GREEN + key, end=", ")                              

print('\n\n' + colorama.Back.YELLOW + "N° found tags:",colorama.Style.RESET_ALL, len(temp))

I show you in images, what I need to execute.

I'ld appreciate any help with the code, thank you very much.

davidfor · 08-15-2022, 03:11 AM

There are a lot of plugins that act on multiple books in the library, read them somehow and then update the metadata. I think a good example of this would be the Count Pages plugin. That actually reads the books, extracts the text and processes it (calculates various statistics) and then update the metadata with those values. That fits fairly well with what you are doing. I would suggest looking at that plugin and then asking questions about what you don't understand and need to do.

Alternatively, it might be possible do this using the Actions Chains plugin. It might be able to run the commands and it has ways to update the metadata for a book. I haven't explored it, so I am not sure if it would work.

Another way is to do this externally using various calibre commands including calibredb. With that, you could use calibredb to search for the books to be changed, convert it to text externally to calibre, process the text and then use calibredb to update metadata.

capink · 08-15-2022, 11:06 AM

Quote:

Originally Posted by davidfor

Alternatively, it might be possible do this using the Actions Chains plugin. It might be able to run the commands and it has ways to update the metadata for a book. I haven't explored it, so I am not sure if it would work.

As David points out, you can run your code through the Action Chains plugin to process multiple books. There is an action called "Run Python Code" that allows you to do just that:

Create a chain that contains a single "Run Python Code" action.

Click the settings button next to the action, a window will pop up where you can copy the following code:

Code:

def tags_from_epub(path_to_epub):
   # the code you posted in your post should go
   # here. The run() function below will pass
   # path_to_epub, so modify it use this instead
   # of the hardcoded epub path

def run(gui, settings, chain):
    db = gui.current_db
    for book_id in chain.scope().get_book_ids():
        fmts = [ fmt.strip() for fmt in db.formats(book_id, index_is_id=True).split(',') ]
        if 'EPUB' in fmts:
            path_to_epub = db.format_abspath(book_id, 'EPUB', index_is_id=True)
            tags_from_epub(path_to_epub)

As you can see, you have to include your code in the the first function and modify it to use path_to_epub

Now, you should find a menu entry with the name of your chain in the Action Chains menu. You can even bind it to a keyboard shortcut if you want.

Edit: The chain you created will only act on books selected in the list view.

lizzie1170 · 08-15-2022, 06:59 PM

Quote:

Originally Posted by capink

As David points out, you can run your code through the Action Chains plugin to process multiple books. There is an action called "Run Python Code" that allows you to do just that:

Create a chain that contains a single "Run Python Code" action.

Click the settings button next to the action, a window will pop up where you can copy the following code:

Code:

def tags_from_epub(path_to_epub):
   # the code you posted in your post should go
   # here. The run() function below will pass
   # path_to_epub, so modify it use this instead
   # of the hardcoded epub path

def run(gui, settings, chain):
    db = gui.current_db
    for book_id in chain.scope().get_book_ids():
        fmts = [ fmt.strip() for fmt in db.formats(book_id, index_is_id=True).split(',') ]
        if 'EPUB' in fmts:
            path_to_epub = db.format_abspath(book_id, 'EPUB', index_is_id=True)
            tags_from_epub(path_to_epub)

As you can see, you have to include your code in the the first function and modify it to use path_to_epub

Now, you should find a menu entry with the name of your chain in the Action Chains menu. You can even bind it to a keyboard shortcut if you want.

Edit: The chain you created will only act on books selected in the list view.

Thanks for the answer for both Senseis

, but I have a little problem

I have Calibre Portable on Windows 7 where version 3.48 is the latest. Action Chain requires version 5.25.0 or later. Do you know of any previous version that is compatible with Windows 7?

capink · 08-17-2022, 08:06 AM

No. Action Chains only works with calibre 5+.

JSWolf · 08-17-2022, 08:41 AM

Given that you'll miss out as more and more plugins are moving on, I highly suggest upgrading to Windows 10 if you can.

Quoth · 08-17-2022, 09:35 AM

Quote:

Originally Posted by JSWolf

Given that you'll miss out as more and more plugins are moving on, I highly suggest upgrading to Windows 10 if you can.

Or Windows 11, Mac M2 with Mac OS or Linux 64 bit LTS versions.

lizzie1170 · 08-17-2022, 08:49 PM

Quote:

Originally Posted by capink

As David points out, you can run your code through the Action Chains plugin to process multiple books. There is an action called "Run Python Code" that allows you to do just that:

Create a chain that contains a single "Run Python Code" action.

Click the settings button next to the action, a window will pop up where you can copy the following code:

Code:

def tags_from_epub(path_to_epub):
   # the code you posted in your post should go
   # here. The run() function below will pass
   # path_to_epub, so modify it use this instead
   # of the hardcoded epub path

def run(gui, settings, chain):
    db = gui.current_db
    for book_id in chain.scope().get_book_ids():
        fmts = [ fmt.strip() for fmt in db.formats(book_id, index_is_id=True).split(',') ]
        if 'EPUB' in fmts:
            path_to_epub = db.format_abspath(book_id, 'EPUB', index_is_id=True)
            tags_from_epub(path_to_epub)

As you can see, you have to include your code in the the first function and modify it to use path_to_epub

Now, you should find a menu entry with the name of your chain in the Action Chains menu. You can even bind it to a keyboard shortcut if you want.

Edit: The chain you created will only act on books selected in the list view.

I followed the recommendations but I'm having trouble reading the content of epubs.

davidfor · 08-17-2022, 09:30 PM

Quote:

Originally Posted by lizzie1170

I followed the recommendations but I'm having trouble reading the content of epubs.

You need to have a Python module called "epub_conversion.utils" available to you. You have that in your original script. Where is it coming from? It does not look like a calibre module and I cannot find "convert_epub_to_lines" in the calibre source. You will either need to add this module so that you can see it when running in calibre. Or change the code to use calibre functions. I know the Count Pages plugin does this (extract the text from an epub), so you can look at that for how to do it.

lizzie1170 · 08-17-2022, 10:18 PM

Quote:

Originally Posted by davidfor

You need to have a Python module called "epub_conversion.utils" available to you. You have that in your original script. Where is it coming from? It does not look like a calibre module and I cannot find "convert_epub_to_lines" in the calibre source. You will either need to add this module so that you can see it when running in calibre. Or change the code to use calibre functions. I know the Count Pages plugin does this (extract the text from an epub), so you can look at that for how to do it.

I wrote the code initially just for an epub book in Visual Studio and it worked for me. Applying code with Calibre functions is another multiverse, that's why I need help putting it in the context of Calibre.

lizzie1170 · 08-19-2022, 02:21 AM

Quote:

Originally Posted by davidfor

You need to have a Python module called "epub_conversion.utils" available to you. You have that in your original script. Where is it coming from? It does not look like a calibre module and I cannot find "convert_epub_to_lines" in the calibre source. You will either need to add this module so that you can see it when running in calibre. Or change the code to use calibre functions. I know the Count Pages plugin does this (extract the text from an epub), so you can look at that for how to do it.

I tried to simulate the "Count Pages" code to read the content of Epubs, when I run the code it tells me: "typeerror 'function' object is not iterable"

Code:

import re
from calibre_plugins.action_chains.actions.base import ChainAction
RE_HTML_BODY = re.compile(u'<body[^>]*>(.*)</body>', re.UNICODE | re.DOTALL | re.IGNORECASE)
with open("test_dict.txt", "r") as f:
    tags_dict = f.read()

def _extract_body_text(data):
    '''Get the body text of this html content wit any html tags stripped'''
    body = RE_HTML_BODY.findall(data)

def tags_from_epub(path_to_epub):
    temp = []
    res = dict()
    for line in _extract_body_text:
        for key,value in tags_dict.items():
         if re.search(rf'{value}', line):
            if value not in temp:
                temp.append(value)
                res[key] = value                
                regex = re.compile(value) 
                match_array = regex.finditer(line) 
                match_list = list(match_array)
                for m in match_list:
                    print(key, ":",m.group())
    
def run(gui, settings, chain):
    db = gui.current_db
    for book_id in chain.scope().get_book_ids():
        fmts = [ fmt.strip() for fmt in db.formats(book_id, index_is_id=True).split(',') ]
        if 'EPUB' in fmts:
            path_to_epub = db.format_abspath(book_id, 'EPUB', index_is_id=True)
            tags_from_epub(path_to_epub)

capink · 08-19-2022, 06:31 AM

Quote:

Originally Posted by lizzie1170

I wrote the code initially just for an epub book in Visual Studio and it worked for me. Applying code with Calibre functions is another multiverse, that's why I need help putting it in the context of Calibre.

Calibre comes with its own copy of python. As such, you cannot (directly) import modules from your system python. You have two options:

Follow davidfor suggestion and develop your solution using code from calibre libraries.
Modify sys.path to include the path(s) from which you want to import the original module. You can search the forum for examples. I am not sure what adverse effects this might lead to, as I have never done done it before. So, you have to research this yourself.

davidfor · 08-19-2022, 08:20 AM

Quote:

Originally Posted by lizzie1170

I tried to simulate the "Count Pages" code to read the content of Epubs, when I run the code it tells me: "typeerror 'function' object is not iterable"

Code:

import re
from calibre_plugins.action_chains.actions.base import ChainAction
RE_HTML_BODY = re.compile(u'<body[^>]*>(.*)</body>', re.UNICODE | re.DOTALL | re.IGNORECASE)
with open("test_dict.txt", "r") as f:
    tags_dict = f.read()

def _extract_body_text(data):
    '''Get the body text of this html content wit any html tags stripped'''
    body = RE_HTML_BODY.findall(data)

def tags_from_epub(path_to_epub):
    temp = []
    res = dict()
    for line in _extract_body_text:
        for key,value in tags_dict.items():
         if re.search(rf'{value}', line):
            if value not in temp:
                temp.append(value)
                res[key] = value                
                regex = re.compile(value) 
                match_array = regex.finditer(line) 
                match_list = list(match_array)
                for m in match_list:
                    print(key, ":",m.group())
    
def run(gui, settings, chain):
    db = gui.current_db
    for book_id in chain.scope().get_book_ids():
        fmts = [ fmt.strip() for fmt in db.formats(book_id, index_is_id=True).split(',') ]
        if 'EPUB' in fmts:
            path_to_epub = db.format_abspath(book_id, 'EPUB', index_is_id=True)
            tags_from_epub(path_to_epub)

Well, you get that error because you didn't actually call the method. "_extract_body_text" appears to be a method that takes a string of some sort. But, when you used it, you treated it as something else.

And that doesn't look anything like what Page Count does. It will open the epub as an iterator, then iterate through the files in the spine, extract the text from each of them and combine them into a big long chunk of text. Then it process that. You have passed "path_to_epub" into your method, but, never actually used it. From the Count Pages plugin, you need to look at statistic.py and follow the flow starting with "get_word_count"

JSWolf · 08-19-2022, 02:13 PM

Quote:

Originally Posted by Quoth

Or Windows 11, Mac M2 with Mac OS or Linux 64 bit LTS versions.

The reason I didn't mention Windows 11 is that a lot of computers that ran Windows 7 back when it was new would probably not be able to upgrade to Windows 11 Due to the security requirements required in the CPU.

BetterRed · 08-19-2022, 06:53 PM

I can't run Windows 11 on my Dell XPS 8920 (i7-7700 @ 3.6 GHz 16 GB RAM) that came with Windows 10 in 2017.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How can I bulk-delete a number of selected tags from all books in my library?	droopy	Library Management	1	08-09-2020 06:24 PM
How are tags selected between multiple metadata sources?	Isomorpheus	Library Management	3	10-19-2019 01:29 PM
HTML Metadata add Tags?	skb	Conversion	5	07-16-2019 07:24 AM
Help Please- Add and Convert Books and Download Metadata not working???	gorgeousbird	Calibre	5	08-14-2012 12:31 AM
ADD Books & extract tags from title?	johnb0647	Calibre	3	01-08-2011 05:36 PM

08-15-2022, 03:11 AM	#2
davidfor Grand Sorcerer Posts: 24,905 Karma: 47303824 Join Date: Jul 2011 Location: Sydney, Australia Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos	There are a lot of plugins that act on multiple books in the library, read them somehow and then update the metadata. I think a good example of this would be the Count Pages plugin. That actually reads the books, extracts the text and processes it (calculates various statistics) and then update the metadata with those values. That fits fairly well with what you are doing. I would suggest looking at that plugin and then asking questions about what you don't understand and need to do. Alternatively, it might be possible do this using the Actions Chains plugin. It might be able to run the commands and it has ways to update the metadata for a book. I haven't explored it, so I am not sure if it would work. Another way is to do this externally using various calibre commands including calibredb. With that, you could use calibredb to search for the books to be changed, convert it to text externally to calibre, process the text and then use calibredb to update metadata.

08-17-2022, 08:06 AM	#5
capink Wizard Posts: 1,196 Karma: 1995558 Join Date: Aug 2015 Device: Kindle	No. Action Chains only works with calibre 5+.

08-17-2022, 08:41 AM	#6
JSWolf Resident Curmudgeon Posts: 79,740 Karma: 145864619 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	Given that you'll miss out as more and more plugins are moving on, I highly suggest upgrading to Windows 10 if you can.

08-19-2022, 06:53 PM	#15
BetterRed null operator (he/him) Posts: 21,717 Karma: 29711016 Join Date: Mar 2012 Location: Sydney Australia Device: none	I can't run Windows 11 on my Dell XPS 8920 (i7-7700 @ 3.6 GHz 16 GB RAM) that came with Windows 10 in 2017.

Advert

Advert