Failed to extract text from gutenberg books

xxyzz · 06-16-2021, 11:26 PM

I get the following error on some gutenberg books when I call "MobiReader.extract_text()". For example, both Kindle ebooks of Alice's Adventures in Wonderland at https://www.gutenberg.org/ebooks/11 will cause this error.

Code:

 mobiReader.extract_text()
  File "calibre/ebooks/mobi/reader/mobi6.py", line 802, in extract_text
  File "calibre/ebooks/mobi/reader/mobi6.py", line 802, in <listcomp>
  File "calibre/ebooks/mobi/reader/mobi6.py", line 797, in text_section
  File "calibre/ebooks/mobi/reader/mobi6.py", line 787, in sizeof_trailing_entries
TypeError: ord() expected a character, but string of length 0 found

KindleUnpack doesn't have this issue, I find the similar code at https://github.com/kevinhendricks/Ki...r.py#L816-L830 but don't know how to fix the bug.

kovidgoyal · 06-16-2021, 11:32 PM

converting pg11.mobi works fine for me, which uses that same code.

xxyzz · 06-16-2021, 11:40 PM

I forget to mention I also run the following code before calling "extract_text()"

Code:

 with open(book_path, 'r+b') as f:
    mu = MetadataUpdater(f)
    mu.update(mi, asin="BBJH94AM2L")

kovidgoyal · 06-17-2021, 12:43 AM

Doing

ebook-meta -t XXX pg11.mobi && ebook-convert pg11.mobi .epub

also works fine for me.

xxyzz · 06-17-2021, 01:18 AM

I not sure which part of the plugin code causes the error but here are the steps to reproduce it:

add the mobi book to calibre
install WordDumb plugin
use this plugin on the book

Only `MetadataUpdater.update(mi, asin=asin)` changes the book file, I also add the "mobi-asin" identifier to the book metadata.

The plugin code:

update ASIN: https://github.com/xxyzz/WordDumb/bl...ata.py#L33-L44
use extract_text(): https://github.com/xxyzz/WordDumb/bl...job.py#L56-L63

xxyzz · 06-17-2021, 01:41 AM

I move the code to a file:

Code:

#!/usr/bin/env python3

from calibre.library import db
from calibre.utils.logging import default_log
from calibre.ebooks.mobi.reader.mobi6 import MobiReader

lib_db = db('~/Calibre Library').new_api
alice_id = 0
for book_id in lib_db.all_book_ids():
    mi = lib_db.get_metadata(book_id)
    if mi.get('title') == "Alice's Adventures in Wonderland":
        alice_id = book_id
        break

book_path = lib_db.format_abspath(alice_id, 'MOBI')
mobiReader = MobiReader(book_path, default_log)
mobiReader.extract_text()

Use calibre-debug to run this code should reproduce the error.

kovidgoyal · 06-17-2021, 02:46 AM

That's a joint mobi file, you cant extract it like that, see plugins/mobi_input.py for how to do it.

xxyzz · 06-17-2021, 05:45 AM

I wasn't aware of this type of book, I should read the code more carefully. Thanks!

06-16-2021, 11:26 PM	#1
xxyzz Evangelist Posts: 413 Karma: 2666666 Join Date: Nov 2020 Device: none	Failed to extract text from gutenberg books I get the following error on some gutenberg books when I call "MobiReader.extract_text()". For example, both Kindle ebooks of Alice's Adventures in Wonderland at https://www.gutenberg.org/ebooks/11 will cause this error. Code: mobiReader.extract_text() File "calibre/ebooks/mobi/reader/mobi6.py", line 802, in extract_text File "calibre/ebooks/mobi/reader/mobi6.py", line 802, in <listcomp> File "calibre/ebooks/mobi/reader/mobi6.py", line 797, in text_section File "calibre/ebooks/mobi/reader/mobi6.py", line 787, in sizeof_trailing_entries TypeError: ord() expected a character, but string of length 0 found KindleUnpack doesn't have this issue, I find the similar code at https://github.com/kevinhendricks/Ki...r.py#L816-L830 but don't know how to fix the bug.

06-16-2021, 11:40 PM	#3
xxyzz Evangelist Posts: 413 Karma: 2666666 Join Date: Nov 2020 Device: none	I forget to mention I also run the following code before calling "extract_text()" Code: with open(book_path, 'r+b') as f: mu = MetadataUpdater(f) mu.update(mi, asin="BBJH94AM2L")

06-17-2021, 01:18 AM	#5
xxyzz Evangelist Posts: 413 Karma: 2666666 Join Date: Nov 2020 Device: none	I not sure which part of the plugin code causes the error but here are the steps to reproduce it: add the mobi book to calibre install WordDumb plugin use this plugin on the book Only `MetadataUpdater.update(mi, asin=asin)` changes the book file, I also add the "mobi-asin" identifier to the book metadata. The plugin code: update ASIN: https://github.com/xxyzz/WordDumb/bl...ata.py#L33-L44 use extract_text(): https://github.com/xxyzz/WordDumb/bl...job.py#L56-L63

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to extract text and images from an .mobi file (ebook)?	Arkadya	Workshop	7	02-28-2019 05:14 AM
Failed to Convert Gutenberg MOBI into DOCX	CrossReach	Conversion	3	08-31-2016 06:58 PM
Extract PDF text and store in custom column	diazlaz	Development	2	12-30-2013 10:00 PM
Best format to extract text from speed vs accuracy	Txomin	Conversion	6	02-07-2013 12:54 AM
Text tool for formatting Gutenberg text files	bob_ninja	Workshop	5	11-13-2007 12:28 PM

06-16-2021, 11:32 PM	#2
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	converting pg11.mobi works fine for me, which uses that same code.

06-17-2021, 12:43 AM	#4
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Doing ebook-meta -t XXX pg11.mobi && ebook-convert pg11.mobi .epub also works fine for me.

06-17-2021, 02:46 AM	#7
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	That's a joint mobi file, you cant extract it like that, see plugins/mobi_input.py for how to do it.

06-17-2021, 05:45 AM	#8
xxyzz Evangelist Posts: 413 Karma: 2666666 Join Date: Nov 2020 Device: none	I wasn't aware of this type of book, I should read the code more carefully. Thanks!