MobileRead Forums - View Single Post - Extract text from selected books, convert them to tags, and add them to metadata.

davidfor · 08-19-2022, 09:20 AM

Quote:

Originally Posted by lizzie1170

I tried to simulate the "Count Pages" code to read the content of Epubs, when I run the code it tells me: "typeerror 'function' object is not iterable"

Code:

import re
from calibre_plugins.action_chains.actions.base import ChainAction
RE_HTML_BODY = re.compile(u'<body[^>]*>(.*)</body>', re.UNICODE | re.DOTALL | re.IGNORECASE)
with open("test_dict.txt", "r") as f:
    tags_dict = f.read()

def _extract_body_text(data):
    '''Get the body text of this html content wit any html tags stripped'''
    body = RE_HTML_BODY.findall(data)

def tags_from_epub(path_to_epub):
    temp = []
    res = dict()
    for line in _extract_body_text:
        for key,value in tags_dict.items():
         if re.search(rf'{value}', line):
            if value not in temp:
                temp.append(value)
                res[key] = value                
                regex = re.compile(value) 
                match_array = regex.finditer(line) 
                match_list = list(match_array)
                for m in match_list:
                    print(key, ":",m.group())
    
def run(gui, settings, chain):
    db = gui.current_db
    for book_id in chain.scope().get_book_ids():
        fmts = [ fmt.strip() for fmt in db.formats(book_id, index_is_id=True).split(',') ]
        if 'EPUB' in fmts:
            path_to_epub = db.format_abspath(book_id, 'EPUB', index_is_id=True)
            tags_from_epub(path_to_epub)

Well, you get that error because you didn't actually call the method. "_extract_body_text" appears to be a method that takes a string of some sort. But, when you used it, you treated it as something else.

And that doesn't look anything like what Page Count does. It will open the epub as an iterator, then iterate through the files in the spine, extract the text from each of them and combine them into a big long chunk of text. Then it process that. You have passed "path_to_epub" into your method, but, never actually used it. From the Count Pages plugin, you need to look at statistic.py and follow the flow starting with "get_word_count"