View Single Post
Old 01-15-2017, 02:01 PM   #64
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,859
Karma: 6120478
Join Date: Nov 2009
Device: many
@slowsmile

Quick question .. why do you need to use bs4 to convert to utf8 here?

Code:
def convertFile2UTF8(wdir, file, encoder):
    """ Converts input file to utf-8 format
    """
    print(' -- Convert input file to utf-8 if required')
    
    original_filename = file
    output = wdir + os.sep + 'fix_encoding.htm'
    outfp = open(output, 'wt', encoding=('utf-8'))
    html = open(file, 'rt', encoding=encoder).read()  
    
    # safely convert to unicode utf-8 using bs4
    soup = BeautifulSoup(html, 'html.parser')
    outfp.writelines(str(soup))
    
    outfp.close()          
    os.remove(file)
    shutil.copy(output, file)        
    os.remove(output)
    
    return(file)
It seems a strange way to do the conversion when you know the encoding.

A short way to handle this might be to use the built in text encoding conversion when writing to and reading from files as so

Code:
    with open(file, 'rt', encoding=encoder) as f1:
        htmldat=f1.read()  
    with open(wdir + os.sep + 'fix_encoding.htm', 'wt', encoding=('utf-8')) as f2:
       f2.write(htmldat)
Or you can read in the file as bytes with binary and write it back as utf-8 using the built in bytes .decode() and string .encode() python capability:

Code:
    htmldat = open(file, 'rb').read()
    # decode converts bytes to string
    htmlstr = htmldat.decode(encoder)
    # encode converts a string to bytes in that encoding
    with open(file, 'wb') as f:
        f.write(htmlstr.encode('utf-8'))
Either would work, unless there is something else specific you are trying to achieve by having bs4 parse the entire thing and then convert it all back to unicode?

Just wondering?

KevinH
KevinH is offline   Reply With Quote