Calibre and parsers, info please

jackie_w · 07-21-2011, 10:40 AM

Please could someone guide me in the right direction.

I'm still feeling my way with Python and object-oriented stuff in general. To date, when I have been analysing epub opfs and occasionally htmls, I have achieved what I needed using regex. However, on poking around calibre source I see parsers being used, namely BeautifulSoup and lxml etree.

I haven't used a parser before, but it looks like something I ought to explore. What I would like to know is, under what circumstances might I choose to use BeautifulSoup rather than lxml etree, and vice versa?

Starson17 · 07-21-2011, 10:57 AM

Quote:

Originally Posted by jackie_w

Please could someone guide me in the right direction.

I'm still feeling my way with Python and object-oriented stuff in general. To date, when I have been analysing epub opfs and occasionally htmls, I have achieved what I needed using regex. However, on poking around calibre source I see parsers being used, namely BeautifulSoup and lxml etree.

I haven't used a parser before, but it looks like something I ought to explore. What I would like to know is, under what circumstances might I choose to use BeautifulSoup rather than lxml etree, and vice versa?

I'm probably not the best to answer this, but I'll comment. I think of lxml etree as useful when the XML is well formatted - typically something created by Calibre. I think of BeautifulSoup as handling html of uncertain origins - typically web pages (particularly in the Calibre recipe system.) It can handle some malformed html and you can easily find stuff when you don't already know the tags.

jackie_w · 07-21-2011, 11:04 AM

Thank you, Starson.

As I am currently interested in epub innards (not necessarily calibre-created) it sounds like I should start with lxml.

Starson17 · 07-21-2011, 11:22 AM

Quote:

Originally Posted by jackie_w

Thank you, Starson.

As I am currently interested in epub innards (not necessarily calibre-created) it sounds like I should start with lxml.

I work on recipes a lot, so I'm much more familiar with BeautifulSoup than lxml, although I've run into both. I'm sure Kovid will comment if you're on the wrong track, although you might want to flesh out what you'll be doing.

kovidgoyal · 07-21-2011, 11:55 AM

I would recommend using lxml, that's what the calibre conversion engine uses. EPUB is not actually HTML, it is XHTML, so you want a parser that has good XML support and support for namespaces.

jackie_w · 07-21-2011, 12:17 PM

I'm working on a personal project which could best be described as an extension to Tweak-epub. It started life as a result of an enhancement request (#8252) I made prior to the current bug system being introduced.

It aims to automate some of the things I find myself doing frequently whilst in Tweak-epub. It also allows me to use some calibre-conversion features without actually doing an epub-epub conversion. (I am currently able to use most of its features on epubs not yet in calibre, even if calibre is not open, so I haven't made good use of existing calibre classes and methods.)

As such I need the ability to analyse the opf, xpgt, css and html files and make tweaks as required. What I could really use is a CSS parser, regex works but is a bit messy.

I am aware that there is a fair amount of overlap with what kiwidude is doing with his 'Modify epub' plugin. However, I had already started this learning project before it became available, so I have continued with it. I'm not interested in any kind of bulk processing, it's purely a one-epub-at-a-time tweaker. It is unlikely to ever see the light of day in the general community as kiwidude is a far better programmer than I will ever be.

jackie_w · 07-21-2011, 12:19 PM

Quote:

Originally Posted by kovidgoyal

I would recommend using lxml, that's what the calibre conversion engine uses. EPUB is not actually HTML, it is XHTML, so you want a parser that has good XML support and support for namespaces.

I see you nipped in whilst I was trying to explain what I'm trying to do. Thank you

kovidgoyal · 07-21-2011, 12:23 PM

If you want to parse CSS you use cssutils, which is what calibre uses.

jackie_w · 07-21-2011, 12:30 PM

Thanks, I've obviously not been poking around the right places so far.

user_none · 07-21-2011, 12:40 PM

lxml is much faster and uses less memory than beautiful soup.

Also look at calibre.ebooks.oeb.stylizer it is used a lot during conversation and makes handling CSS easy. Look at oeb2html as part of HTMLZ for a somewhat simple example.

jackie_w · 07-21-2011, 01:30 PM

Thanks for the practical examples. I find Python documentation hard-going when trying to get started.

Agama · 07-22-2011, 04:20 PM

Quote:

Originally Posted by jackie_w

It aims to automate some of the things I find myself doing frequently whilst in Tweak-epub. It also allows me to use some calibre-conversion features without actually doing an epub-epub conversion. (I am currently able to use most of its features on epubs not yet in calibre, even if calibre is not open, so I haven't made good use of existing calibre classes and methods.)

As such I need the ability to analyse the opf, xpgt, css and html files and make tweaks as required. What I could really use is a CSS parser, regex works but is a bit messy.

I'm really interested in what you're doing here as I am going down the same route, processing ePubs to automate post-conversion changes. So if you get things working I would be interested to see what you come up with.

07-21-2011, 10:40 AM	#1
jackie_w Grand Sorcerer Posts: 6,272 Karma: 16544702 Join Date: Sep 2009 Location: UK Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3	Calibre and parsers, info please Please could someone guide me in the right direction. I'm still feeling my way with Python and object-oriented stuff in general. To date, when I have been analysing epub opfs and occasionally htmls, I have achieved what I needed using regex. However, on poking around calibre source I see parsers being used, namely BeautifulSoup and lxml etree. I haven't used a parser before, but it looks like something I ought to explore. What I would like to know is, under what circumstances might I choose to use BeautifulSoup rather than lxml etree, and vice versa?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Info to start working with Calibre source	samphan	Development	25	01-20-2012 06:00 AM
Could Calibre pull metadate info from Goodreads?	Arainais	Calibre	1	01-07-2011 11:58 AM
adding books to nook with Calibre - info not showing up	bcarlson	Barnes & Noble NOOK	2	01-04-2011 09:59 PM
Calibre not overwriting ePub CSS font info	jswinden	Calibre	4	04-06-2010 02:29 PM
Different info between reader and Calibre db	eyp	Calibre	8	03-19-2009 04:53 AM

07-21-2011, 11:04 AM	#3
jackie_w Grand Sorcerer Posts: 6,272 Karma: 16544702 Join Date: Sep 2009 Location: UK Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3	Thank you, Starson. As I am currently interested in epub innards (not necessarily calibre-created) it sounds like I should start with lxml.

07-21-2011, 11:55 AM	#5
kovidgoyal creator of calibre Posts: 45,664 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I would recommend using lxml, that's what the calibre conversion engine uses. EPUB is not actually HTML, it is XHTML, so you want a parser that has good XML support and support for namespaces.

07-21-2011, 12:17 PM	#6
jackie_w Grand Sorcerer Posts: 6,272 Karma: 16544702 Join Date: Sep 2009 Location: UK Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3	I'm working on a personal project which could best be described as an extension to Tweak-epub. It started life as a result of an enhancement request (#8252) I made prior to the current bug system being introduced. It aims to automate some of the things I find myself doing frequently whilst in Tweak-epub. It also allows me to use some calibre-conversion features without actually doing an epub-epub conversion. (I am currently able to use most of its features on epubs not yet in calibre, even if calibre is not open, so I haven't made good use of existing calibre classes and methods.) As such I need the ability to analyse the opf, xpgt, css and html files and make tweaks as required. What I could really use is a CSS parser, regex works but is a bit messy. I am aware that there is a fair amount of overlap with what kiwidude is doing with his 'Modify epub' plugin. However, I had already started this learning project before it became available, so I have continued with it. I'm not interested in any kind of bulk processing, it's purely a one-epub-at-a-time tweaker. It is unlikely to ever see the light of day in the general community as kiwidude is a far better programmer than I will ever be.

07-21-2011, 12:23 PM	#8
kovidgoyal creator of calibre Posts: 45,664 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	If you want to parse CSS you use cssutils, which is what calibre uses.

07-21-2011, 12:30 PM	#9
jackie_w Grand Sorcerer Posts: 6,272 Karma: 16544702 Join Date: Sep 2009 Location: UK Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3	Thanks, I've obviously not been poking around the right places so far.

07-21-2011, 12:40 PM	#10
user_none Sigil & calibre developer Posts: 2,487 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR	lxml is much faster and uses less memory than beautiful soup. Also look at calibre.ebooks.oeb.stylizer it is used a lot during conversation and makes handling CSS easy. Look at oeb2html as part of HTMLZ for a somewhat simple example.

07-21-2011, 01:30 PM	#11
jackie_w Grand Sorcerer Posts: 6,272 Karma: 16544702 Join Date: Sep 2009 Location: UK Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3	Thanks for the practical examples. I find Python documentation hard-going when trying to get started.

Advert

Advert