MobileRead Forums - View Single Post

meghane_e · 12-07-2025, 12:05 AM

I have an EPUB file that just uses images as the chapters names. I'm trying to condense the file since it's so large because of all the graphics. The TOC is correct and usable.

I've written some Regex-Func functions for simpler stuff but this is harder for me to wrap my brain around.

Here's a sample in the TOC
<p class="toc1"><a href="part0013.html#CCNA1-f66f6b0d51c44ea49012bf2fe61db1ae" class="toc_text"><strong class="calibre1">9. </strong> The Chickens Draw First Blood</a></p>
<p class="toc1"><a href="part0014.html#DB7S1-f66f6b0d51c44ea49012bf2fe61db1ae" class="toc_text"><strong class="calibre1">10. </strong> My Singing Makes Things Worse, and Everyone Is Totally Shocked</a></p>

#part0013.html:
<body class="calibre">
<div class="fullimage" id="DB7S1-f66f6b0d51c44ea49012bf2fe61db1ae"><img alt="" src="../images/00021.jpeg" class="calibre3"/></div>

I can see how to do this by brute-force: unzip the EPUB and using Python directly on the HTML files (I'm capable of that). But surely there are possible other tools available.

Thank you for suggestions or pointers.

[EDIT]
Here's what I have so far (which is missing most of it I know):
[EDIT2]
I'm pretty my code start below is not going in the right direction.

Code:

from bs4 import BeautifulSoup

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):

    soup = BeautifulSoup(html, "html.parser")

    string = soup.a.text.split('.')
    chap_num = string[0].strip()
    chap_title = string[1].strip()
    id = soup.a['href'].split('#')

    # not sure how to go to/check the next item in the TOC

    # not sure how to place the output onto the right page
    return f'<h2>{chap_num} {chap_title}</h2>'

12-07-2025, 12:05 AM	#1
meghane_e Zealot Posts: 125 Karma: 38500 Join Date: Sep 2016 Location: San Jose, CA Device: Kindle moving to Kobo or Boox	Using a TOC to create Chapter headings I have an EPUB file that just uses images as the chapters names. I'm trying to condense the file since it's so large because of all the graphics. The TOC is correct and usable. I've written some Regex-Func functions for simpler stuff but this is harder for me to wrap my brain around. Here's a sample in the TOC <p class="toc1"><a href="part0013.html#CCNA1-f66f6b0d51c44ea49012bf2fe61db1ae" class="toc_text"><strong class="calibre1">9. </strong> The Chickens Draw First Blood</a></p> <p class="toc1"><a href="part0014.html#DB7S1-f66f6b0d51c44ea49012bf2fe61db1ae" class="toc_text"><strong class="calibre1">10. </strong> My Singing Makes Things Worse, and Everyone Is Totally Shocked</a></p> #part0013.html: <body class="calibre"> <div class="fullimage" id="DB7S1-f66f6b0d51c44ea49012bf2fe61db1ae"><img alt="" src="../images/00021.jpeg" class="calibre3"/></div> I can see how to do this by brute-force: unzip the EPUB and using Python directly on the HTML files (I'm capable of that). But surely there are possible other tools available. Thank you for suggestions or pointers. [EDIT] Here's what I have so far (which is missing most of it I know): [EDIT2] I'm pretty my code start below is not going in the right direction. Code: from bs4 import BeautifulSoup def replace(match, number, file_name, metadata, dictionaries, data, functions, args, kwargs): soup = BeautifulSoup(html, "html.parser") string = soup.a.text.split('.') chap_num = string[0].strip() chap_title = string[1].strip() id = soup.a['href'].split('#') # not sure how to go to/check the next item in the TOC # not sure how to place the output onto the right page return f'<h2>{chap_num} {chap_title}</h2>' Last edited by meghane_e; 12-07-2025 at 03:38 PM. Reason: Adding to function as I go*