View Single Post
Old 08-19-2018, 10:28 PM   #9
slowsmile
Witchman
slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.slowsmile ought to be getting tired of karma fortunes by now.
 
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
@pittendrigh...It would't be difficult to write a python command line script to extract head and html tags from each epub file. Just export all the epub xhtml files from Sigil to a directory and then run a python script from that directory that does something like this(using python 3.4):

Code:

from bs4 import BeautifulSoup

# get the file list(separated by spaces) as a string from the command prompt
print("\nInput html file names:\n")
files = raw-input()

# convert input string to file list
file_list = []
file_list = files.split()

# remove the html and head tags
for file in file_list:
    outfile = 'new_' + file
    outfp = open(outfile, "wt", encoding="utf-8")
    html = open(file, 'rt', encoding='utf-8').read()
    soup = BeautifulSoup(html, 'html5lib')   # parses like a web brower
     
    for html_tag in soup.find('html', limit=1)
    if html_tag:
        html_tag.extract()

    for head_tag in soup.find_all('head', limit=1)
    if head_tag:
        head_tag.extract() 
        
    outfp.writelines(str(soup))
    outfp.close()
....Or you could perhaps even do it without using BeautifulSoup like this:

Code:
# get the file list as a string from the command prompt
print("\nInput html file names:\n")
files = raw-input()

# convert input string to file list
file_list = []
file_list = files.split()

# remove the html and head tags
for file in file_list:
    outfile = 'new_' + file
    outfp = open(outfile, 'wt', encoding=utf-8)
    with open(file, 'rt', encoding='utf-8') as infp:
        for line in infp:    
            if line.lstrip().startstwith('<html') or \
                line.lstrip().startstwith('</html>') or \
                line.lstrip().startstwith('<head') or \
                line.lstrip().startstwith('</head'>):
                continue
            else:
                outfp.write(line)
     outfp.close()            
You could probably also use Tidy in a script to do exactly the same as the above.

Last edited by slowsmile; 08-19-2018 at 10:38 PM.
slowsmile is offline   Reply With Quote