Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 11-26-2025, 06:52 PM   #1
Trester99
Member
Trester99 began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Jun 2024
Device: Kindle Paperwhite
Python Scripting Help: Safely extracting first 2 images in reading order (Polish/Cont

Hi everyone,

I am writing a Python script (running via calibre-debug -e) to identify books in my library that have duplicate covers (e.g. a cover.jpg wrapper followed immediately by an identical titlepage.jpg inside the text).

My goal is to extract the first two images that the user would see when opening the book, in strict reading order.

I am using calibre.ebooks.oeb.polish.container.get_container to access the book.

The Problem: I initially tried accessing container.spine_names or parsing the OPF using lxml to get the reading order. However, I kept running into AttributeError: 'lxml.etree._Element' object has no attribute 'split' errors, likely due to some environment conflict or how I'm handling the object returned by the container.

I switched to using Regex to manually parse the <spine> in the OPF text to get the ordered list of HTML files, then regex again to find <img src> tags.

The Issue: This approach is "leaky." Sometimes I pick up images from the back of the book (like "Also by this Author" thumbnails) because the book structure isn't granular (e.g. large HTML files).

My Question: Is there a robust, "Calibre-native" way to ask the Container object for:

The Spine items in linear reading order.

The image assets referenced by those spine items, in the order they appear visually?

I want to reliably say "Get me the very first image rendered in the book, and the very next distinct image rendered after that," regardless of file structure.

Any pointers on the correct API calls to avoid manual XML/HTML parsing would be appreciated!
Trester99 is offline   Reply With Quote
Old 11-26-2025, 11:24 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,656
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
container.spine_names
kovidgoyal is offline   Reply With Quote
Advert
Old Yesterday, 08:44 PM   #3
Trester99
Member
Trester99 began at the beginning.
 
Posts: 17
Karma: 10
Join Date: Jun 2024
Device: Kindle Paperwhite
Thank you for such a quick response! We had actually tried container.spine_names initially but made the mistake of treating it like a list (spine[0]) before converting it. It is working perfectly now!
Trester99 is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Order multi-series storyline in reading order plaid Kobo Reader 2 02-23-2023 05:30 PM
Feature Request: Polish Books, Remove Unused Images Jaws Calibre 2 11-13-2020 11:42 PM
Forma Can I safely delete the ".kobo-images" folder? It's taking 500MB of space. droopy Kobo Reader 36 10-14-2019 04:51 AM
Extracting images from an ePub MacEachaidh Sigil 2 11-08-2010 10:50 AM
Extracting html/images from within .imp files! nrapallo IMP 12 03-10-2009 11:22 PM


All times are GMT -4. The time now is 05:19 PM.


MobileRead.com is a privately owned, operated and funded community.