You should not use BeautifulSoup to parse. The parsing strategy to follow would be:
1) Try to parse as XML, implementing various simple corrections so that only slightly invalid documents still parse.
2) If (1) fails, parse as HTML 5
3) If (2) fails parse as HTML 4 and/or use BeautifulSoup
See parse_utils.py in the calibre source code.
Of course, the correct solution is to use the exact parsing algorithm used by the software that generated the CFI, since that is no practical, IMO the above cascade will likely give yo the best results, with perhaps a few modifications to handle common cases.
|