Quote:
Originally Posted by jackie_w
Please could someone guide me in the right direction.
I'm still feeling my way with Python and object-oriented stuff in general. To date, when I have been analysing epub opfs and occasionally htmls, I have achieved what I needed using regex. However, on poking around calibre source I see parsers being used, namely BeautifulSoup and lxml etree.
I haven't used a parser before, but it looks like something I ought to explore. What I would like to know is, under what circumstances might I choose to use BeautifulSoup rather than lxml etree, and vice versa?
|
I'm probably not the best to answer this, but I'll comment. I think of lxml etree as useful when the XML is well formatted - typically something created by Calibre. I think of BeautifulSoup as handling html of uncertain origins - typically web pages (particularly in the Calibre recipe system.) It can handle some malformed html and you can easily find stuff when you don't already know the tags.