|
|||||||
![]() |
|
|
Thread Tools | Search this Thread |
|
|
#1 |
|
Member
![]() Posts: 17
Karma: 10
Join Date: Jun 2024
Device: Kindle Paperwhite
|
Python Scripting Help: Safely extracting first 2 images in reading order (Polish/Cont
Hi everyone,
I am writing a Python script (running via calibre-debug -e) to identify books in my library that have duplicate covers (e.g. a cover.jpg wrapper followed immediately by an identical titlepage.jpg inside the text). My goal is to extract the first two images that the user would see when opening the book, in strict reading order. I am using calibre.ebooks.oeb.polish.container.get_container to access the book. The Problem: I initially tried accessing container.spine_names or parsing the OPF using lxml to get the reading order. However, I kept running into AttributeError: 'lxml.etree._Element' object has no attribute 'split' errors, likely due to some environment conflict or how I'm handling the object returned by the container. I switched to using Regex to manually parse the <spine> in the OPF text to get the ordered list of HTML files, then regex again to find <img src> tags. The Issue: This approach is "leaky." Sometimes I pick up images from the back of the book (like "Also by this Author" thumbnails) because the book structure isn't granular (e.g. large HTML files). My Question: Is there a robust, "Calibre-native" way to ask the Container object for: The Spine items in linear reading order. The image assets referenced by those spine items, in the order they appear visually? I want to reliably say "Get me the very first image rendered in the book, and the very next distinct image rendered after that," regardless of file structure. Any pointers on the correct API calls to avoid manual XML/HTML parsing would be appreciated! |
|
|
|
|
|
#2 |
|
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,656
Karma: 28549046
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
container.spine_names
|
|
|
|
| Advert | |
|
|
|
|
#3 |
|
Member
![]() Posts: 17
Karma: 10
Join Date: Jun 2024
Device: Kindle Paperwhite
|
Thank you for such a quick response! We had actually tried container.spine_names initially but made the mistake of treating it like a list (spine[0]) before converting it. It is working perfectly now!
|
|
|
|
![]() |
| Thread Tools | Search this Thread |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Order multi-series storyline in reading order | plaid | Kobo Reader | 2 | 02-23-2023 05:30 PM |
| Feature Request: Polish Books, Remove Unused Images | Jaws | Calibre | 2 | 11-13-2020 11:42 PM |
| Forma Can I safely delete the ".kobo-images" folder? It's taking 500MB of space. | droopy | Kobo Reader | 36 | 10-14-2019 04:51 AM |
| Extracting images from an ePub | MacEachaidh | Sigil | 2 | 11-08-2010 10:50 AM |
| Extracting html/images from within .imp files! | nrapallo | IMP | 12 | 03-10-2009 11:22 PM |