MobileRead Forums - View Single Post - How does the calibre viewer calculate page number and total pages?

kovidgoyal · 09-24-2014, 09:10 AM

You should not use BeautifulSoup to parse. The parsing strategy to follow would be:

1) Try to parse as XML, implementing various simple corrections so that only slightly invalid documents still parse.
2) If (1) fails, parse as HTML 5
3) If (2) fails parse as HTML 4 and/or use BeautifulSoup

See parse_utils.py in the calibre source code.

Of course, the correct solution is to use the exact parsing algorithm used by the software that generated the CFI, since that is no practical, IMO the above cascade will likely give yo the best results, with perhaps a few modifications to handle common cases.

09-24-2014, 09:10 AM	#10
kovidgoyal creator of calibre Posts: 46,013 Karma: 29579868 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You should not use BeautifulSoup to parse. The parsing strategy to follow would be: 1) Try to parse as XML, implementing various simple corrections so that only slightly invalid documents still parse. 2) If (1) fails, parse as HTML 5 3) If (2) fails parse as HTML 4 and/or use BeautifulSoup See parse_utils.py in the calibre source code. Of course, the correct solution is to use the exact parsing algorithm used by the software that generated the CFI, since that is no practical, IMO the above cascade will likely give yo the best results, with perhaps a few modifications to handle common cases.