|
|
#1 |
|
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 448
Karma: 3000000
Join Date: Nov 2020
Device: none
|
Failed to extract text from gutenberg books
I get the following error on some gutenberg books when I call "MobiReader.extract_text()". For example, both Kindle ebooks of Alice's Adventures in Wonderland at https://www.gutenberg.org/ebooks/11 will cause this error.
Code:
mobiReader.extract_text() File "calibre/ebooks/mobi/reader/mobi6.py", line 802, in extract_text File "calibre/ebooks/mobi/reader/mobi6.py", line 802, in <listcomp> File "calibre/ebooks/mobi/reader/mobi6.py", line 797, in text_section File "calibre/ebooks/mobi/reader/mobi6.py", line 787, in sizeof_trailing_entries TypeError: ord() expected a character, but string of length 0 found |
|
|
|
|
|
#2 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,610
Karma: 28549044
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
converting pg11.mobi works fine for me, which uses that same code.
|
|
|
|
| Advert | |
|
|
|
|
#3 |
|
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 448
Karma: 3000000
Join Date: Nov 2020
Device: none
|
I forget to mention I also run the following code before calling "extract_text()"
Code:
with open(book_path, 'r+b') as f:
mu = MetadataUpdater(f)
mu.update(mi, asin="BBJH94AM2L")
|
|
|
|
|
|
#4 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,610
Karma: 28549044
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Doing
ebook-meta -t XXX pg11.mobi && ebook-convert pg11.mobi .epub also works fine for me. |
|
|
|
|
|
#5 |
|
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 448
Karma: 3000000
Join Date: Nov 2020
Device: none
|
I not sure which part of the plugin code causes the error but here are the steps to reproduce it:
Only `MetadataUpdater.update(mi, asin=asin)` changes the book file, I also add the "mobi-asin" identifier to the book metadata. The plugin code:
|
|
|
|
| Advert | |
|
|
|
|
#6 |
|
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 448
Karma: 3000000
Join Date: Nov 2020
Device: none
|
I move the code to a file:
Code:
#!/usr/bin/env python3
from calibre.library import db
from calibre.utils.logging import default_log
from calibre.ebooks.mobi.reader.mobi6 import MobiReader
lib_db = db('~/Calibre Library').new_api
alice_id = 0
for book_id in lib_db.all_book_ids():
mi = lib_db.get_metadata(book_id)
if mi.get('title') == "Alice's Adventures in Wonderland":
alice_id = book_id
break
book_path = lib_db.format_abspath(alice_id, 'MOBI')
mobiReader = MobiReader(book_path, default_log)
mobiReader.extract_text()
Last edited by xxyzz; 06-17-2021 at 02:45 AM. |
|
|
|
|
|
#7 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,610
Karma: 28549044
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That's a joint mobi file, you cant extract it like that, see plugins/mobi_input.py for how to do it.
|
|
|
|
|
|
#8 |
|
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 448
Karma: 3000000
Join Date: Nov 2020
Device: none
|
I wasn't aware of this type of book, I should read the code more carefully. Thanks!
|
|
|
|
![]() |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| How to extract text and images from an .mobi file (ebook)? | Arkadya | Workshop | 7 | 02-28-2019 06:14 AM |
| Failed to Convert Gutenberg MOBI into DOCX | CrossReach | Conversion | 3 | 08-31-2016 07:58 PM |
| Extract PDF text and store in custom column | diazlaz | Development | 2 | 12-30-2013 11:00 PM |
| Best format to extract text from speed vs accuracy | Txomin | Conversion | 6 | 02-07-2013 01:54 AM |
| Text tool for formatting Gutenberg text files | bob_ninja | Workshop | 5 | 11-13-2007 01:28 PM |