Extract text from selected books, convert them to tags, and add them to metadata. - Page 2

lizzie1170 · 08-21-2022, 02:25 AM

Quote:

Originally Posted by davidfor

Well, you get that error because you didn't actually call the method. "_extract_body_text" appears to be a method that takes a string of some sort. But, when you used it, you treated it as something else.

And that doesn't look anything like what Page Count does. It will open the epub as an iterator, then iterate through the files in the spine, extract the text from each of them and combine them into a big long chunk of text. Then it process that. You have passed "path_to_epub" into your method, but, never actually used it. From the Count Pages plugin, you need to look at statistic.py and follow the flow starting with "get_word_count"

I added get_word_count definition but it depends on other definitions. Running the code results in TypeError: TagsFromEpub.run() takes 3 positional arguments but 4 were given.

Code:

from calibre.ebooks.oeb.iterator import EbookIterator
from calibre_plugins.action_chains.actions.base import ChainAction

with open("test_dict.txt", "r") as f:
    tags_dict = f.read()

class TagsFromEpub(ChainAction):
    name = 'Tags_F_Epub'
    support_scopes = True

    def get_word_count(iterator, book_path, icu_wordcount):
        '''Given an iterator for the epub (if already opened/converted), estimate a word count'''
        from calibre.utils.localization import get_lang
        if iterator is None:
            iterator = _open_epub_file(book_path)
            lang = iterator.opf.language
            lang = get_lang() if not lang else lang
            DEFAULT_STORE_VALUES = {}
            KEY_USE_ICU_WORDCOUNT = 'useIcuWordcount'
            icu_wordcount = c.get(cfg.KEY_USE_ICU_WORDCOUNT, cfg.DEFAULT_STORE_VALUES[cfg.KEY_USE_ICU_WORDCOUNT])
            count = _get_epub_standard_word_count(iterator, lang, icu_wordcount)
            print('\tWord count:', count)
            return iterator, count

    def _open_epub_file(book_path, strip_html=False):
        '''Given a path to an EPUB file, read the contents into a giant block of text'''
        iterator = EbookIterator(book_path)
        iterator.__enter__(only_input_plugin=True, run_char_count=True, read_anchor_map=False)
        return iterator
    
    def _get_epub_standard_word_count(iterator, lang='en', icu_wordcount=False):
        '''This algorithm counts individual words instead of pages'''
        book_text = _read_epub_contents(iterator, strip_html=True)
        wordcount = None
        if icu_wordcount:
            try:
                from calibre.spell.break_iterator import count_words
                print('\tWord count using icu_wordcount - trying to count_words')
                wordcount = count_words(book_text, lang)
                print('\tWord count - used count_words:', wordcount)
            except:
                try: # The above method is new and no-one will have it as of 08/01/2016.
                    print('\tWord count using icu_wordcount - trying to import split_into_words_and_positions')
                    from calibre.spell.break_iterator import split_into_words_and_positions
                    print('\tWord count - trying split_into_words_and_positions:')
                    wordcount = len(split_into_words_and_positions(book_text, lang))
                    print('\tWord count - used split_into_words_and_positions:', wordcount)
                except:
                    pass
        if not wordcount: # If not using icu wordcount, or it failed, use the old method.
            from calibre.utils.wordcount import get_wordcount_obj
            print('\tWord count using older method - trying get_wordcount_obj')
            wordcount = get_wordcount_obj(book_text)
            wordcount = wordcount.words
        return wordcount 
    
    def tags_from_epub(path_to_epub):
        temp = []
        res = dict()
        for line in wordcount:
            for key,value in tags_dict.items():
                if re.search(rf'{value}', line):
                    if value not in temp:
                        temp.append(value)
                        res[key] = value                
                        regex = re.compile(value) 
                        match_array = regex.finditer(line) 
                        match_list = list(match_array)
                        for m in match_list:
                            print(key, ":",m.group())
    
    def run(gui, settings, chain):
        db = gui.current_db
        for book_id in chain.scope().get_book_ids():
            fmts = [ fmt.strip() for fmt in db.formats(book_id, index_is_id=True).split(',') ]
            if 'EPUB' in fmts:
                path_to_epub = db.format_abspath(book_id, 'EPUB', index_is_id=True)
                tags_from_epub(path_to_epub)

capink · 08-21-2022, 06:53 AM

Quote:

Originally Posted by lizzie1170

I added get_word_count definition but it depends on other definitions. Running the code results in TypeError: TagsFromEpub.run() takes 3 positional arguments but 4 were given.

Code:

from calibre.ebooks.oeb.iterator import EbookIterator
from calibre_plugins.action_chains.actions.base import ChainAction

with open("test_dict.txt", "r") as f:
    tags_dict = f.read()

class TagsFromEpub(ChainAction):
    name = 'Tags_F_Epub'
    support_scopes = True

    def get_word_count(iterator, book_path, icu_wordcount):
        '''Given an iterator for the epub (if already opened/converted), estimate a word count'''
        from calibre.utils.localization import get_lang
        if iterator is None:
            iterator = _open_epub_file(book_path)
            lang = iterator.opf.language
            lang = get_lang() if not lang else lang
            DEFAULT_STORE_VALUES = {}
            KEY_USE_ICU_WORDCOUNT = 'useIcuWordcount'
            icu_wordcount = c.get(cfg.KEY_USE_ICU_WORDCOUNT, cfg.DEFAULT_STORE_VALUES[cfg.KEY_USE_ICU_WORDCOUNT])
            count = _get_epub_standard_word_count(iterator, lang, icu_wordcount)
            print('\tWord count:', count)
            return iterator, count

    def _open_epub_file(book_path, strip_html=False):
        '''Given a path to an EPUB file, read the contents into a giant block of text'''
        iterator = EbookIterator(book_path)
        iterator.__enter__(only_input_plugin=True, run_char_count=True, read_anchor_map=False)
        return iterator

    def _get_epub_standard_word_count(iterator, lang='en', icu_wordcount=False):
        '''This algorithm counts individual words instead of pages'''
        book_text = _read_epub_contents(iterator, strip_html=True)
        wordcount = None
        if icu_wordcount:
            try:
                from calibre.spell.break_iterator import count_words
                print('\tWord count using icu_wordcount - trying to count_words')
                wordcount = count_words(book_text, lang)
                print('\tWord count - used count_words:', wordcount)
            except:
                try: # The above method is new and no-one will have it as of 08/01/2016.
                    print('\tWord count using icu_wordcount - trying to import split_into_words_and_positions')
                    from calibre.spell.break_iterator import split_into_words_and_positions
                    print('\tWord count - trying split_into_words_and_positions:')
                    wordcount = len(split_into_words_and_positions(book_text, lang))
                    print('\tWord count - used split_into_words_and_positions:', wordcount)
                except:
                    pass
        if not wordcount: # If not using icu wordcount, or it failed, use the old method.
            from calibre.utils.wordcount import get_wordcount_obj
            print('\tWord count using older method - trying get_wordcount_obj')
            wordcount = get_wordcount_obj(book_text)
            wordcount = wordcount.words
        return wordcount

    def tags_from_epub(path_to_epub):
        temp = []
        res = dict()
        for line in wordcount:
            for key,value in tags_dict.items():
                if re.search(rf'{value}', line):
                    if value not in temp:
                        temp.append(value)
                        res[key] = value
                        regex = re.compile(value)
                        match_array = regex.finditer(line)
                        match_list = list(match_array)
                        for m in match_list:
                            print(key, ":",m.group())

    def run(gui, settings, chain):
        db = gui.current_db
        for book_id in chain.scope().get_book_ids():
            fmts = [ fmt.strip() for fmt in db.formats(book_id, index_is_id=True).split(',') ]
            if 'EPUB' in fmts:
                path_to_epub = db.format_abspath(book_id, 'EPUB', index_is_id=True)
                tags_from_epub(path_to_epub)

Why are subclassing ChainAction?! This is intended for completely different thing. It is used for creating custom actions in the module manager, not for the "Run Python Code".

For the "Run Python Code" you should use the run() as separate function, not a method for any class, as I previously told you to do in this post (note that there is NO mention of subclassing ChainAction). The other methods should be separate functions as well.

I do not understand what you are trying to do with your code, and I do not have the time to debug it. If you can get a working function that returns whatever tags you want, I can help from there. However, here is a couple of points regarding your code:

get_word_count() is defined but not called anywhere in the code.
in tags_from_epub() you reference a variable called wordcount which is not assigned before in any part of the code.

P.S. If your main problem is converting the epub to text, the easiest way is using calibre's conversion as follows:

Code:

def convert_to_text(path_to_epub):
    import os, subprocess
    from calibre.ptempfile import PersistentTemporaryDirectory
    tdir = PersistentTemporaryDirectory('_temp_convert')
    output_file = os.path.join(tdir, 'temp.txt')
    cmd = 'ebook-convert "{}" "{}"'.format(path_to_epub, output_file)
    subprocess.call(cmd, shell='true')
    return output_file

path_to_txt = convert_to_text(path_to_epub)

JSWolf · 08-23-2022, 04:21 PM

Quote:

Originally Posted by BetterRed

I can't run Windows 11 on my Dell XPS 8920 (i7-7700 @ 3.6 GHz 16 GB RAM) that came with Windows 10 in 2017.

I cannot run Windows 11 on my Surface Pro 2. But then, I'm fine with Windows 10 (for now).

lizzie1170 · 08-29-2022, 07:27 PM

Quote:

Originally Posted by capink

I do not understand what you are trying to do with your code, and I do not have the time to debug it. If you can get a working function that returns whatever tags you want, I can help from there.

I am very sorry for taking your time, in my original code I was able to get my tags and added an additional label called PROCESSED but with the PRINT function. I don't really know how to do it with Calibre functions. This is the code I have now.

Code:

import re
import ast
import os

with open(r"D:\User\Calibre Portable\Python_tareas\docs_pys\test_dict.txt") as f:
    tags_dict = f.read()

def convert_to_text(path_to_epub):
    import os, subprocess
    from calibre.ptempfile import PersistentTemporaryDirectory
    tdir = PersistentTemporaryDirectory('_temp_convert')
    output_file = os.path.join(tdir, 'temp.txt')
    cmd = 'ebook-convert "{}" "{}"'.format(path_to_epub, output_file)
    subprocess.call(cmd, shell='true')
    return output_file

def tags_from_epub(path_to_epub):
    path_to_txt = convert_to_text(path_to_epub)
    temp = []
    res = dict()
    for line in path_to_txt:
        for key,value in tags_dict.items():
            if re.search(rf'{value}', line):
                if value not in temp:
                    temp.append(value)
                    res[key] = value
                    regex = re.compile(value)
                    match_array = regex.finditer(line)
                    match_list = list(match_array)
                    for m in match_list:
                        print(key)
    print("processed ")

def run(gui, settings, chain):
    db = gui.current_db
    for book_id in chain.scope().get_book_ids():
        fmts = [ fmt.strip() for fmt in db.formats(book_id, index_is_id=True).split(',') ]
        if 'EPUB' in fmts:
            path_to_epub = db.format_abspath(book_id, 'EPUB', index_is_id=True)
            tags_from_epub(path_to_epub)

capink · 08-30-2022, 04:13 PM

Try the chain attached to this post. To import it: click Action Chains > Add/Modify chains > right click chain dialog > import chain.

Edit: Try it first on a test book to see whether you want to modify it further to suit you.

lizzie1170 · 09-03-2022, 05:28 AM

Quote:

Originally Posted by capink

Try the chain attached to this post. To import it: click Action Chains > Add/Modify chains > right click chain dialog > import chain.

Edit: Try it first on a test book to see whether you want to modify it further to suit you.

Thank you very much but I get an error message.

FileNotFoundError:[Errno 2] No such file or directory: 'C:\\Users\\AppData\\Local\\Temp\\calibre_ko1yesmi \\u0n9u8sm_temp_convert\\temp.txt'

calibre 6.3* Portable embedded-python: True
Windows-10-10.0.19041-SP0 Windows ('64bit', 'WindowsPE')
('Windows', '10', '10.0.19041')
Python 3.10.1

Traceback (most recent call last):
File "calibre_plugins.action_chains.action", line 449, in run_chain
File "calibre_plugins.action_chains.chains", line 390, in run
File "calibre_plugins.action_chains.chains", line 205, in _run_loop
File "calibre_plugins.action_chains.chains", line 182, in _run_loop
File "calibre_plugins.action_chains.actions.code", line 130, in run
File "module", line 36, in run
File "module", line 16, in tags_from_epub

Rellwood · 09-03-2022, 07:13 PM

If this plugin works, then I will be very happy. I have been trying to update my tags by using the ENF plugin but it comes back with the most nouns and I have to weed through those to find "Dragon" or "Vampire" or "Biker" or "Wizard". I could use "Powersearch" but that still requires me to create a tag.

capink · 09-04-2022, 11:17 AM

Quote:

Originally Posted by lizzie1170

Thank you very much but I get an error message.

FileNotFoundError:[Errno 2] No such file or directory: 'C:\\Users\\AppData\\Local\\Temp\\calibre_ko1yesmi \\u0n9u8sm_temp_convert\\temp.txt'

It is working for me without producing this error. I am currently on Linux and do not have an access to Windows machine. Maybe it has something to do the OS. Try the one attached below and see whether it makes a difference. Beyond that, I'm afraid I cannot help.

BetterRed · 09-04-2022, 05:22 PM

Quote:

Originally Posted by lizzie1170

FileNotFoundError:[Errno 2] No such file or directory: 'C:\\Users\\???????\\AppData\\Local\\Temp\\calibre_ko1yesmi\\u0n9u8sm _temp_convert\\temp.txt'

Quote:

Originally Posted by capink

It is working for me . . .

@lizzie1170, @capink

Which user?

BR

capink · 09-04-2022, 06:18 PM

Quote:

Originally Posted by BetterRed

@lizzie1170, @capink

Which user?

BR

I see what you are getting at. Problem is; that is not a hardcoded path, but a temporary path calculated by a calibre function. Why is it coming out this way? I don't know, and I cannot test on Windows. So, I replaced the calibre function with a standard python function hoping that it might solve the problem.

BetterRed · 09-04-2022, 06:49 PM

I suggest @lizzie1170 tries making use of the CALIBRE_TEMP_DIR Environment variable.

BR

ownedbycats · 09-04-2022, 08:16 PM

Quote:

Originally Posted by BetterRed

I suggest @lizzie1170 tries making use of the CALIBRE_TEMP_DIR Environment variable.

BR

Unfortunately, even with that you'll still get the randomized names.

BetterRed · 09-04-2022, 11:12 PM

Quote:

Originally Posted by ownedbycats

Unfortunately, even with that you'll still get the randomized names.

Sigh - which has nowt to do with the issue I raised in post #24 - the lack of a user name after C:\\Users\\.

lizzie1170 · 09-05-2022, 02:49 AM

Quote:

Originally Posted by BetterRed

I suggest @lizzie1170 tries making use of the CALIBRE_TEMP_DIR Environment variable.

BR

One question, can I set any route for "CALIBRE_TEMP_DIR" or is there a criteria to set this route. The code creates an EMPTY temporary folder without the text file.

BetterRed · 09-05-2022, 03:28 AM

Quote:

Originally Posted by lizzie1170

One question, can I set any route for "CALIBRE_TEMP_DIR" or is there a criteria to set this route. The code creates an EMPTY temporary folder without the text file.

AFAIK, anywhere will do, mine is at C:\_AppData\Calibre\Temp

Click image for larger version

Name: Screenshot 2022-09-05 173208.jpg
Views: 380
Size: 75.7 KB
ID: 196291

I don't know what should be in the text file, I don't use the Action Chains plugin. My suggestion was aimed at getting the calibre temp folder out of the specifics of the Windows ecosystem.

BR

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How can I bulk-delete a number of selected tags from all books in my library?	droopy	Library Management	1	08-09-2020 06:24 PM
How are tags selected between multiple metadata sources?	Isomorpheus	Library Management	3	10-19-2019 01:29 PM
HTML Metadata add Tags?	skb	Conversion	5	07-16-2019 07:24 AM
Help Please- Add and Convert Books and Download Metadata not working???	gorgeousbird	Calibre	5	08-14-2012 12:31 AM
ADD Books & extract tags from title?	johnb0647	Calibre	3	01-08-2011 05:36 PM

09-03-2022, 07:13 PM	#22
Rellwood Library Breeder (She/Her) Posts: 1,265 Karma: 1937891 Join Date: Apr 2015 Location: Fullerton, California Device: Paperwhite 2015 (2), PW 2024 (12 GEN), PW 2023 (11 GEN), Scribe (1st)	If this plugin works, then I will be very happy. I have been trying to update my tags by using the ENF plugin but it comes back with the most nouns and I have to weed through those to find "Dragon" or "Vampire" or "Biker" or "Wizard". I could use "Powersearch" but that still requires me to create a tag.

09-04-2022, 06:49 PM	#26
BetterRed null operator (he/him) Posts: 21,718 Karma: 29711016 Join Date: Mar 2012 Location: Sydney Australia Device: none	I suggest @lizzie1170 tries making use of the CALIBRE_TEMP_DIR Environment variable. BR

Advert

Advert