Python Scripting Help: Safely extracting first 2 images in reading order (Polish/Cont

Trester99 · 11-26-2025, 06:52 PM

Hi everyone,

I am writing a Python script (running via calibre-debug -e) to identify books in my library that have duplicate covers (e.g. a cover.jpg wrapper followed immediately by an identical titlepage.jpg inside the text).

My goal is to extract the first two images that the user would see when opening the book, in strict reading order.

I am using calibre.ebooks.oeb.polish.container.get_container to access the book.

The Problem: I initially tried accessing container.spine_names or parsing the OPF using lxml to get the reading order. However, I kept running into AttributeError: 'lxml.etree._Element' object has no attribute 'split' errors, likely due to some environment conflict or how I'm handling the object returned by the container.

I switched to using Regex to manually parse the <spine> in the OPF text to get the ordered list of HTML files, then regex again to find <img src> tags.

The Issue: This approach is "leaky." Sometimes I pick up images from the back of the book (like "Also by this Author" thumbnails) because the book structure isn't granular (e.g. large HTML files).

My Question: Is there a robust, "Calibre-native" way to ask the Container object for:

The Spine items in linear reading order.

The image assets referenced by those spine items, in the order they appear visually?

I want to reliably say "Get me the very first image rendered in the book, and the very next distinct image rendered after that," regardless of file structure.

Any pointers on the correct API calls to avoid manual XML/HTML parsing would be appreciated!

kovidgoyal · 11-26-2025, 11:24 PM

container.spine_names

Trester99 · Yesterday, 08:44 PM

Thank you for such a quick response! We had actually tried container.spine_names initially but made the mistake of treating it like a list (spine[0]) before converting it. It is working perfectly now!

11-26-2025, 06:52 PM	#1
Trester99 Member Posts: 17 Karma: 10 Join Date: Jun 2024 Device: Kindle Paperwhite	Python Scripting Help: Safely extracting first 2 images in reading order (Polish/Cont Hi everyone, I am writing a Python script (running via calibre-debug -e) to identify books in my library that have duplicate covers (e.g. a cover.jpg wrapper followed immediately by an identical titlepage.jpg inside the text). My goal is to extract the first two images that the user would see when opening the book, in strict reading order. I am using calibre.ebooks.oeb.polish.container.get_container to access the book. The Problem: I initially tried accessing container.spine_names or parsing the OPF using lxml to get the reading order. However, I kept running into AttributeError: 'lxml.etree._Element' object has no attribute 'split' errors, likely due to some environment conflict or how I'm handling the object returned by the container. I switched to using Regex to manually parse the <spine> in the OPF text to get the ordered list of HTML files, then regex again to find <img src> tags. The Issue: This approach is "leaky." Sometimes I pick up images from the back of the book (like "Also by this Author" thumbnails) because the book structure isn't granular (e.g. large HTML files). My Question: Is there a robust, "Calibre-native" way to ask the Container object for: The Spine items in linear reading order. The image assets referenced by those spine items, in the order they appear visually? I want to reliably say "Get me the very first image rendered in the book, and the very next distinct image rendered after that," regardless of file structure. Any pointers on the correct API calls to avoid manual XML/HTML parsing would be appreciated!

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Order multi-series storyline in reading order	plaid	Kobo Reader	2	02-23-2023 05:30 PM
Feature Request: Polish Books, Remove Unused Images	Jaws	Calibre	2	11-13-2020 11:42 PM
Forma Can I safely delete the ".kobo-images" folder? It's taking 500MB of space.	droopy	Kobo Reader	36	10-14-2019 04:51 AM
Extracting images from an ePub	MacEachaidh	Sigil	2	11-08-2010 10:50 AM
Extracting html/images from within .imp files!	nrapallo	IMP	12	03-10-2009 11:22 PM

11-26-2025, 11:24 PM	#2
kovidgoyal creator of calibre Posts: 45,656 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	container.spine_names

Yesterday, 08:44 PM	#3
Trester99 Member Posts: 17 Karma: 10 Join Date: Jun 2024 Device: Kindle Paperwhite	Thank you for such a quick response! We had actually tried container.spine_names initially but made the mistake of treating it like a list (spine[0]) before converting it. It is working perfectly now!

Advert