MobileRead Forums - View Single Post - Python Scripting Help: Safely extracting first 2 images in reading order (Polish/Cont

Trester99 · 11-26-2025, 06:52 PM

Hi everyone,

I am writing a Python script (running via calibre-debug -e) to identify books in my library that have duplicate covers (e.g. a cover.jpg wrapper followed immediately by an identical titlepage.jpg inside the text).

My goal is to extract the first two images that the user would see when opening the book, in strict reading order.

I am using calibre.ebooks.oeb.polish.container.get_container to access the book.

The Problem: I initially tried accessing container.spine_names or parsing the OPF using lxml to get the reading order. However, I kept running into AttributeError: 'lxml.etree._Element' object has no attribute 'split' errors, likely due to some environment conflict or how I'm handling the object returned by the container.

I switched to using Regex to manually parse the <spine> in the OPF text to get the ordered list of HTML files, then regex again to find <img src> tags.

The Issue: This approach is "leaky." Sometimes I pick up images from the back of the book (like "Also by this Author" thumbnails) because the book structure isn't granular (e.g. large HTML files).

My Question: Is there a robust, "Calibre-native" way to ask the Container object for:

The Spine items in linear reading order.

The image assets referenced by those spine items, in the order they appear visually?

I want to reliably say "Get me the very first image rendered in the book, and the very next distinct image rendered after that," regardless of file structure.

Any pointers on the correct API calls to avoid manual XML/HTML parsing would be appreciated!

11-26-2025, 06:52 PM	#1
Trester99 Member Posts: 17 Karma: 10 Join Date: Jun 2024 Device: Kindle Paperwhite	Python Scripting Help: Safely extracting first 2 images in reading order (Polish/Cont Hi everyone, I am writing a Python script (running via calibre-debug -e) to identify books in my library that have duplicate covers (e.g. a cover.jpg wrapper followed immediately by an identical titlepage.jpg inside the text). My goal is to extract the first two images that the user would see when opening the book, in strict reading order. I am using calibre.ebooks.oeb.polish.container.get_container to access the book. The Problem: I initially tried accessing container.spine_names or parsing the OPF using lxml to get the reading order. However, I kept running into AttributeError: 'lxml.etree._Element' object has no attribute 'split' errors, likely due to some environment conflict or how I'm handling the object returned by the container. I switched to using Regex to manually parse the <spine> in the OPF text to get the ordered list of HTML files, then regex again to find <img src> tags. The Issue: This approach is "leaky." Sometimes I pick up images from the back of the book (like "Also by this Author" thumbnails) because the book structure isn't granular (e.g. large HTML files). My Question: Is there a robust, "Calibre-native" way to ask the Container object for: The Spine items in linear reading order. The image assets referenced by those spine items, in the order they appear visually? I want to reliably say "Get me the very first image rendered in the book, and the very next distinct image rendered after that," regardless of file structure. Any pointers on the correct API calls to avoid manual XML/HTML parsing would be appreciated!