MobileRead Forums - View Single Post - Sigil output with no html or head elements

slowsmile · 08-19-2018, 10:28 PM

@pittendrigh...It would't be difficult to write a python command line script to extract head and html tags from each epub file. Just export all the epub xhtml files from Sigil to a directory and then run a python script from that directory that does something like this(using python 3.4):

Code:


from bs4 import BeautifulSoup

# get the file list(separated by spaces) as a string from the command prompt
print("\nInput html file names:\n")
files = raw-input()

# convert input string to file list
file_list = []
file_list = files.split()

# remove the html and head tags
for file in file_list:
    outfile = 'new_' + file
    outfp = open(outfile, "wt", encoding="utf-8")
    html = open(file, 'rt', encoding='utf-8').read()
    soup = BeautifulSoup(html, 'html5lib')   # parses like a web brower
     
    for html_tag in soup.find('html', limit=1)
    if html_tag:
        html_tag.extract()

    for head_tag in soup.find_all('head', limit=1)
    if head_tag:
        head_tag.extract() 
        
    outfp.writelines(str(soup))
    outfp.close()

....Or you could perhaps even do it without using BeautifulSoup like this:

Code:

# get the file list as a string from the command prompt
print("\nInput html file names:\n")
files = raw-input()

# convert input string to file list
file_list = []
file_list = files.split()

# remove the html and head tags
for file in file_list:
    outfile = 'new_' + file
    outfp = open(outfile, 'wt', encoding=utf-8)
    with open(file, 'rt', encoding='utf-8') as infp:
        for line in infp:    
            if line.lstrip().startstwith('<html') or \
                line.lstrip().startstwith('</html>') or \
                line.lstrip().startstwith('<head') or \
                line.lstrip().startstwith('</head'>):
                continue
            else:
                outfp.write(line)
     outfp.close()

You could probably also use Tidy in a script to do exactly the same as the above.