06-16-2021, 11:26 PM | #1 |
Evangelist
Posts: 413
Karma: 2666666
Join Date: Nov 2020
Device: none
|
Failed to extract text from gutenberg books
I get the following error on some gutenberg books when I call "MobiReader.extract_text()". For example, both Kindle ebooks of Alice's Adventures in Wonderland at https://www.gutenberg.org/ebooks/11 will cause this error.
Code:
mobiReader.extract_text() File "calibre/ebooks/mobi/reader/mobi6.py", line 802, in extract_text File "calibre/ebooks/mobi/reader/mobi6.py", line 802, in <listcomp> File "calibre/ebooks/mobi/reader/mobi6.py", line 797, in text_section File "calibre/ebooks/mobi/reader/mobi6.py", line 787, in sizeof_trailing_entries TypeError: ord() expected a character, but string of length 0 found |
06-16-2021, 11:32 PM | #2 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
converting pg11.mobi works fine for me, which uses that same code.
|
06-16-2021, 11:40 PM | #3 |
Evangelist
Posts: 413
Karma: 2666666
Join Date: Nov 2020
Device: none
|
I forget to mention I also run the following code before calling "extract_text()"
Code:
with open(book_path, 'r+b') as f: mu = MetadataUpdater(f) mu.update(mi, asin="BBJH94AM2L") |
06-17-2021, 12:43 AM | #4 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Doing
ebook-meta -t XXX pg11.mobi && ebook-convert pg11.mobi .epub also works fine for me. |
06-17-2021, 01:18 AM | #5 |
Evangelist
Posts: 413
Karma: 2666666
Join Date: Nov 2020
Device: none
|
I not sure which part of the plugin code causes the error but here are the steps to reproduce it:
Only `MetadataUpdater.update(mi, asin=asin)` changes the book file, I also add the "mobi-asin" identifier to the book metadata. The plugin code:
|
06-17-2021, 01:41 AM | #6 |
Evangelist
Posts: 413
Karma: 2666666
Join Date: Nov 2020
Device: none
|
I move the code to a file:
Code:
#!/usr/bin/env python3 from calibre.library import db from calibre.utils.logging import default_log from calibre.ebooks.mobi.reader.mobi6 import MobiReader lib_db = db('~/Calibre Library').new_api alice_id = 0 for book_id in lib_db.all_book_ids(): mi = lib_db.get_metadata(book_id) if mi.get('title') == "Alice's Adventures in Wonderland": alice_id = book_id break book_path = lib_db.format_abspath(alice_id, 'MOBI') mobiReader = MobiReader(book_path, default_log) mobiReader.extract_text() Last edited by xxyzz; 06-17-2021 at 01:45 AM. |
06-17-2021, 02:46 AM | #7 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That's a joint mobi file, you cant extract it like that, see plugins/mobi_input.py for how to do it.
|
06-17-2021, 05:45 AM | #8 |
Evangelist
Posts: 413
Karma: 2666666
Join Date: Nov 2020
Device: none
|
I wasn't aware of this type of book, I should read the code more carefully. Thanks!
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to extract text and images from an .mobi file (ebook)? | Arkadya | Workshop | 7 | 02-28-2019 05:14 AM |
Failed to Convert Gutenberg MOBI into DOCX | CrossReach | Conversion | 3 | 08-31-2016 06:58 PM |
Extract PDF text and store in custom column | diazlaz | Development | 2 | 12-30-2013 10:00 PM |
Best format to extract text from speed vs accuracy | Txomin | Conversion | 6 | 02-07-2013 12:54 AM |
Text tool for formatting Gutenberg text files | bob_ninja | Workshop | 5 | 11-13-2007 12:28 PM |