View Single Post
Old 10-25-2023, 08:06 PM   #11
tomsem
Grand Sorcerer
tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.tomsem ought to be getting tired of karma fortunes by now.
 
Posts: 6,944
Karma: 27060153
Join Date: Apr 2009
Location: USA
Device: iPhone 15PM, Kindle Scribe, iPad mini 6, PocketBook InkPad Color 3
Okay here is a first take at a script to add a chapter section header and remove the chapter strings from the note headers. It adds a bullet character to the chapter strings so they are set off from the 'higher level' section header. It will output new html files with '-new' appended to the original filename.

If it does not find 'chapter pattern' in the note headings, it will not do anything.

You can use calibre to run it:

Code:
[path to calibre executables]calibre-debug gather_chapter_notes.py html1 [html2, ...]
gather_chapter_notes.py
Code:
from re import match, DOTALL
from sys import argv

from bs4 import BeautifulSoup

chapter_pattern = r".*? - (.*?)( > ).*"


def gather_chapter_notes(html: str):
    soup = BeautifulSoup(html, 'html.parser')
    title_insert = {}
    remove_these = set()
    for note_heading in soup.find_all('div', class_='noteHeading'):
        content = note_heading.contents[-1]
        if matches := match(chapter_pattern, content, flags=DOTALL):
            title, token = matches.groups()
            if title not in title_insert:
                title_insert[title] = note_heading
            remove_these.add(f'{title} > ')

    for title, node in title_insert.items():
        title_section = soup.new_tag('div', attrs=[('class', 'sectionHeading')])
        title_section.string = f'● {title}'
        node.insert_before(title_section)

    html = str(soup)
    for remove_this in remove_these:
        html = html.replace(remove_this, '')
    return html


for arg in argv[1:]:
    with open(arg) as f:
        html_text = f.read()

    new_html = gather_chapter_notes(html=html_text)

    with open(arg.replace('.html', '-new.html'), 'w') as f:
        f.write(new_html)
Attached Thumbnails
Click image for larger version

Name:	Screenshot 2023-10-25 at 5.01.15 PM.png
Views:	142
Size:	851.2 KB
ID:	204424   Click image for larger version

Name:	Screenshot 2023-10-25 at 5.01.34 PM.png
Views:	149
Size:	821.9 KB
ID:	204425  
Attached Files
File Type: zip gather_chapter_notes.py.zip (1.0 KB, 120 views)
tomsem is offline   Reply With Quote