MobileRead Forums - View Single Post

Starson17 · 07-21-2011, 09:57 AM

Quote:

Originally Posted by jackie_w

Please could someone guide me in the right direction.

I'm still feeling my way with Python and object-oriented stuff in general. To date, when I have been analysing epub opfs and occasionally htmls, I have achieved what I needed using regex. However, on poking around calibre source I see parsers being used, namely BeautifulSoup and lxml etree.

I haven't used a parser before, but it looks like something I ought to explore. What I would like to know is, under what circumstances might I choose to use BeautifulSoup rather than lxml etree, and vice versa?

I'm probably not the best to answer this, but I'll comment. I think of lxml etree as useful when the XML is well formatted - typically something created by Calibre. I think of BeautifulSoup as handling html of uncertain origins - typically web pages (particularly in the Calibre recipe system.) It can handle some malformed html and you can easily find stuff when you don't already know the tags.