View Single Post
Old 09-24-2014, 08:10 AM   #10
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,435
Karma: 27757438
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
You should not use BeautifulSoup to parse. The parsing strategy to follow would be:

1) Try to parse as XML, implementing various simple corrections so that only slightly invalid documents still parse.
2) If (1) fails, parse as HTML 5
3) If (2) fails parse as HTML 4 and/or use BeautifulSoup

See parse_utils.py in the calibre source code.

Of course, the correct solution is to use the exact parsing algorithm used by the software that generated the CFI, since that is no practical, IMO the above cascade will likely give yo the best results, with perhaps a few modifications to handle common cases.
kovidgoyal is offline   Reply With Quote