|
![]() |
|
Thread Tools | Search this Thread |
![]() |
#1 |
Member
![]() Posts: 12
Karma: 10
Join Date: Jul 2022
Device: none
|
Extract text from selected books, convert them to tags, and add them to metadata.
I created a python code in VScode that allows me to perform text searches within an epub book, these searches consist of matching the text of the book with regular expressions. These regular expressions come from patterns that I formulated for the tags in my library. I have already managed to get over 400 tags this way and I have a custom column for them, I add the @ symbol at the beginning to differentiate them from tags downloaded from other sources. I have 3000+ books and I want each of them to be attacked by these 400+ regular expressions.
![]() ** Run the code on selected books from my library (books_ids). ** Found tags are added to the metadata. ** Add a verification tag confirming that the book was processed. ![]() Code:
import re import ast from epub_conversion.utils import open_book, convert_epub_to_lines import colorama colorama.init() book = open_book("Cthulhu Mythos.epub") lines = convert_epub_to_lines(book) with open("test_dict.txt", "r") as data: tags_dict = ast.literal_eval(data.read()) print(colorama.Back.YELLOW + 'Matches(regex - book text):',colorama.Style.RESET_ALL) temp = [] res = dict() for line in lines: for key,value in tags_dict.items(): if re.search(rf'{value}', line): if value not in temp: temp.append(value) res[key] = value regex = re.compile(value) match_array = regex.finditer(line) match_list = list(match_array) for m in match_list: print(colorama.Fore.MAGENTA + key, ":",colorama.Style.RESET_ALL + m.group()) print('\n',colorama.Back.YELLOW + 'Found tags:',colorama.Style.RESET_ALL) temp = [] res = dict() for line in lines: for key,value in tags_dict.items(): if re.search(rf'{value}', line): if value not in temp: temp.append(value) res[key] = value print(colorama.Fore.GREEN + key, end=", ") print('\n\n' + colorama.Back.YELLOW + "N° found tags:",colorama.Style.RESET_ALL, len(temp)) ![]() I'ld appreciate any help with the code, thank you very much. |
![]() |
![]() |
![]() |
#2 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
There are a lot of plugins that act on multiple books in the library, read them somehow and then update the metadata. I think a good example of this would be the Count Pages plugin. That actually reads the books, extracts the text and processes it (calculates various statistics) and then update the metadata with those values. That fits fairly well with what you are doing. I would suggest looking at that plugin and then asking questions about what you don't understand and need to do.
Alternatively, it might be possible do this using the Actions Chains plugin. It might be able to run the commands and it has ways to update the metadata for a book. I haven't explored it, so I am not sure if it would work. Another way is to do this externally using various calibre commands including calibredb. With that, you could use calibredb to search for the books to be changed, convert it to text externally to calibre, process the text and then use calibredb to update metadata. |
![]() |
![]() |
Advert | |
|
![]() |
#3 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,196
Karma: 1995558
Join Date: Aug 2015
Device: Kindle
|
Quote:
Edit: The chain you created will only act on books selected in the list view. Last edited by capink; 08-15-2022 at 11:09 AM. |
|
![]() |
![]() |
![]() |
#4 | |
Member
![]() Posts: 12
Karma: 10
Join Date: Jul 2022
Device: none
|
Quote:
![]() ![]() |
|
![]() |
![]() |
![]() |
#5 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,196
Karma: 1995558
Join Date: Aug 2015
Device: Kindle
|
No. Action Chains only works with calibre 5+.
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,740
Karma: 145864619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Given that you'll miss out as more and more plugins are moving on, I highly suggest upgrading to Windows 10 if you can.
|
![]() |
![]() |
![]() |
#7 |
Still reading
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 14,010
Karma: 105092227
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper
|
|
![]() |
![]() |
![]() |
#8 | |
Member
![]() Posts: 12
Karma: 10
Join Date: Jul 2022
Device: none
|
Quote:
|
|
![]() |
![]() |
![]() |
#9 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
You need to have a Python module called "epub_conversion.utils" available to you. You have that in your original script. Where is it coming from? It does not look like a calibre module and I cannot find "convert_epub_to_lines" in the calibre source. You will either need to add this module so that you can see it when running in calibre. Or change the code to use calibre functions. I know the Count Pages plugin does this (extract the text from an epub), so you can look at that for how to do it.
|
![]() |
![]() |
![]() |
#10 | |
Member
![]() Posts: 12
Karma: 10
Join Date: Jul 2022
Device: none
|
Quote:
|
|
![]() |
![]() |
![]() |
#11 | |
Member
![]() Posts: 12
Karma: 10
Join Date: Jul 2022
Device: none
|
Quote:
![]() Code:
import re from calibre_plugins.action_chains.actions.base import ChainAction RE_HTML_BODY = re.compile(u'<body[^>]*>(.*)</body>', re.UNICODE | re.DOTALL | re.IGNORECASE) with open("test_dict.txt", "r") as f: tags_dict = f.read() def _extract_body_text(data): '''Get the body text of this html content wit any html tags stripped''' body = RE_HTML_BODY.findall(data) def tags_from_epub(path_to_epub): temp = [] res = dict() for line in _extract_body_text: for key,value in tags_dict.items(): if re.search(rf'{value}', line): if value not in temp: temp.append(value) res[key] = value regex = re.compile(value) match_array = regex.finditer(line) match_list = list(match_array) for m in match_list: print(key, ":",m.group()) def run(gui, settings, chain): db = gui.current_db for book_id in chain.scope().get_book_ids(): fmts = [ fmt.strip() for fmt in db.formats(book_id, index_is_id=True).split(',') ] if 'EPUB' in fmts: path_to_epub = db.format_abspath(book_id, 'EPUB', index_is_id=True) tags_from_epub(path_to_epub) |
|
![]() |
![]() |
![]() |
#12 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,196
Karma: 1995558
Join Date: Aug 2015
Device: Kindle
|
Quote:
|
|
![]() |
![]() |
![]() |
#13 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 24,905
Karma: 47303824
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
Quote:
And that doesn't look anything like what Page Count does. It will open the epub as an iterator, then iterate through the files in the spine, extract the text from each of them and combine them into a big long chunk of text. Then it process that. You have passed "path_to_epub" into your method, but, never actually used it. From the Count Pages plugin, you need to look at statistic.py and follow the flow starting with "get_word_count" |
|
![]() |
![]() |
![]() |
#14 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,740
Karma: 145864619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
The reason I didn't mention Windows 11 is that a lot of computers that ran Windows 7 back when it was new would probably not be able to upgrade to Windows 11 Due to the security requirements required in the CPU.
|
![]() |
![]() |
![]() |
#15 |
null operator (he/him)
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 21,717
Karma: 29711016
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
I can't run Windows 11 on my Dell XPS 8920 (i7-7700 @ 3.6 GHz 16 GB RAM) that came with Windows 10 in 2017.
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
How can I bulk-delete a number of selected tags from all books in my library? | droopy | Library Management | 1 | 08-09-2020 06:24 PM |
How are tags selected between multiple metadata sources? | Isomorpheus | Library Management | 3 | 10-19-2019 01:29 PM |
HTML Metadata add Tags? | skb | Conversion | 5 | 07-16-2019 07:24 AM |
Help Please- Add and Convert Books and Download Metadata not working??? | gorgeousbird | Calibre | 5 | 08-14-2012 12:31 AM |
ADD Books & extract tags from title? | johnb0647 | Calibre | 3 | 01-08-2011 05:36 PM |