MobileRead Forums - View Single Post - Extract text from selected books, convert them to tags, and add them to metadata.

lizzie1170 · 08-19-2022, 03:21 AM

Quote:

Originally Posted by davidfor

You need to have a Python module called "epub_conversion.utils" available to you. You have that in your original script. Where is it coming from? It does not look like a calibre module and I cannot find "convert_epub_to_lines" in the calibre source. You will either need to add this module so that you can see it when running in calibre. Or change the code to use calibre functions. I know the Count Pages plugin does this (extract the text from an epub), so you can look at that for how to do it.

I tried to simulate the "Count Pages" code to read the content of Epubs, when I run the code it tells me: "typeerror 'function' object is not iterable"

Code:

import re
from calibre_plugins.action_chains.actions.base import ChainAction
RE_HTML_BODY = re.compile(u'<body[^>]*>(.*)</body>', re.UNICODE | re.DOTALL | re.IGNORECASE)
with open("test_dict.txt", "r") as f:
    tags_dict = f.read()

def _extract_body_text(data):
    '''Get the body text of this html content wit any html tags stripped'''
    body = RE_HTML_BODY.findall(data)

def tags_from_epub(path_to_epub):
    temp = []
    res = dict()
    for line in _extract_body_text:
        for key,value in tags_dict.items():
         if re.search(rf'{value}', line):
            if value not in temp:
                temp.append(value)
                res[key] = value                
                regex = re.compile(value) 
                match_array = regex.finditer(line) 
                match_list = list(match_array)
                for m in match_list:
                    print(key, ":",m.group())
    
def run(gui, settings, chain):
    db = gui.current_db
    for book_id in chain.scope().get_book_ids():
        fmts = [ fmt.strip() for fmt in db.formats(book_id, index_is_id=True).split(',') ]
        if 'EPUB' in fmts:
            path_to_epub = db.format_abspath(book_id, 'EPUB', index_is_id=True)
            tags_from_epub(path_to_epub)