View Single Post
Old 12-26-2020, 12:47 AM   #30
twynn92
Junior Member
twynn92 can extract oil from cheesetwynn92 can extract oil from cheesetwynn92 can extract oil from cheesetwynn92 can extract oil from cheesetwynn92 can extract oil from cheesetwynn92 can extract oil from cheesetwynn92 can extract oil from cheesetwynn92 can extract oil from cheese
 
Posts: 8
Karma: 1000
Join Date: Dec 2020
Device: none
Quote:
Originally Posted by twynn92 View Post
I am running into the fact that the KF8 text refers to one more image than the KFX, so will have to take a look at the surrounding text to see why that is
Whoops. That was an error on my part, as I did not account for the fact that not all images use the img tag. I've since rectified that and made the search far less restrictive by just using a different pattern, and everything aligns exactly:
KF8 (from Kindle Unpack-created EPUB): "Images/[^\.]+\.\w+"
KFX (from KFX Input-generated EPUB): "image_[^\.]+\.\w+"

Note: The first image in the KFX (referenced in part0000.xhtml) is supposed to be an SVG, while the corresponding file in the KF8 is an image, though the cover page in the KF8 (cover_page.xhtml) is also an SVG. In other words, the KF8 has two separate images -- the less compressed coverxxxxx, and the more compressed imagexxxxx.

As the book cover image in the KFX is larger in file size than either of the two, I could just duplicate and replace the two book covers with it; but since cover_page.xhtml in the KF8 seems to better match the first image in part0000.xhtml in the KFX, I'll probably rename that as the cover image in the KF8 -- leaving the first image in the KF8 alone. I'll have to see how things go when doing another book to see how things match.

Because the KFX didn't have an alt text attribute for any of the images (img tag), essentially making them invisible to screen readers, I had to construct a find/replace regexp to add in some fake alt text by duplicating the file name there. It was pretty, but hacks rarely ever are. I could then make a textual comparison between the two versions to make sure that the images matched where they were supposed two. I just did the first and last five images to make sure I didn't have any offsets. Even though the total number of images in both files matched exactly, and removing the duplicates also still matched, I still wanted to make sure, especially since cover_page.xhtml didn't exist when converting the EPUB to HTML using Pandoc.

As this is definitely doable, and the gruntwork of renaming can be automated, I can expend the effort for books that I know to have a lot of images, but only for those, as it's still very much an annoying process. Essentially, the steps are:
1. Download both KF8 and KFX versions, decrypt with DeDRM, and convert to EPUB using two differen CLIs.
2. Convert the EPUBs to HTML using Pandoc to get the image file names in viewing order.
3. Use Notepad++ to Strip the source of everything but the image file names (one per line) using a combination of regexp and search functionality.
4. Make sure the total number of references matches in both files, and use another regexp to strip the duplicated entries, again making sure the total number of entries match. Also make sure the file extensions match too, but hopefully that will never be an issue since Amazon converts already anyways.
5. Paste the contents of both lists into one file, using Notepad++ and a regexp to combine the file names line-by-line to create a rename command in terminal.
6. Rename using a terminal, replace the low-resolution images with the high-resolution images, and finally, repackage the final EPUB with the ebook-polish CLI.

Last edited by twynn92; 12-26-2020 at 12:51 AM. Reason: Include part about matching file extensions
twynn92 is offline   Reply With Quote