Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 07-21-2011, 09:40 AM   #1
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,212
Karma: 16534894
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
Calibre and parsers, info please

Please could someone guide me in the right direction.

I'm still feeling my way with Python and object-oriented stuff in general. To date, when I have been analysing epub opfs and occasionally htmls, I have achieved what I needed using regex. However, on poking around calibre source I see parsers being used, namely BeautifulSoup and lxml etree.

I haven't used a parser before, but it looks like something I ought to explore. What I would like to know is, under what circumstances might I choose to use BeautifulSoup rather than lxml etree, and vice versa?
jackie_w is offline   Reply With Quote
Old 07-21-2011, 09:57 AM   #2
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by jackie_w View Post
Please could someone guide me in the right direction.

I'm still feeling my way with Python and object-oriented stuff in general. To date, when I have been analysing epub opfs and occasionally htmls, I have achieved what I needed using regex. However, on poking around calibre source I see parsers being used, namely BeautifulSoup and lxml etree.

I haven't used a parser before, but it looks like something I ought to explore. What I would like to know is, under what circumstances might I choose to use BeautifulSoup rather than lxml etree, and vice versa?
I'm probably not the best to answer this, but I'll comment. I think of lxml etree as useful when the XML is well formatted - typically something created by Calibre. I think of BeautifulSoup as handling html of uncertain origins - typically web pages (particularly in the Calibre recipe system.) It can handle some malformed html and you can easily find stuff when you don't already know the tags.
Starson17 is offline   Reply With Quote
Old 07-21-2011, 10:04 AM   #3
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,212
Karma: 16534894
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
Thank you, Starson.

As I am currently interested in epub innards (not necessarily calibre-created) it sounds like I should start with lxml.
jackie_w is offline   Reply With Quote
Old 07-21-2011, 10:22 AM   #4
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by jackie_w View Post
Thank you, Starson.

As I am currently interested in epub innards (not necessarily calibre-created) it sounds like I should start with lxml.
I work on recipes a lot, so I'm much more familiar with BeautifulSoup than lxml, although I've run into both. I'm sure Kovid will comment if you're on the wrong track, although you might want to flesh out what you'll be doing.
Starson17 is offline   Reply With Quote
Old 07-21-2011, 10:55 AM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
I would recommend using lxml, that's what the calibre conversion engine uses. EPUB is not actually HTML, it is XHTML, so you want a parser that has good XML support and support for namespaces.
kovidgoyal is offline   Reply With Quote
Old 07-21-2011, 11:17 AM   #6
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,212
Karma: 16534894
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
I'm working on a personal project which could best be described as an extension to Tweak-epub. It started life as a result of an enhancement request (#8252) I made prior to the current bug system being introduced.

It aims to automate some of the things I find myself doing frequently whilst in Tweak-epub. It also allows me to use some calibre-conversion features without actually doing an epub-epub conversion. (I am currently able to use most of its features on epubs not yet in calibre, even if calibre is not open, so I haven't made good use of existing calibre classes and methods.)

As such I need the ability to analyse the opf, xpgt, css and html files and make tweaks as required. What I could really use is a CSS parser, regex works but is a bit messy.

I am aware that there is a fair amount of overlap with what kiwidude is doing with his 'Modify epub' plugin. However, I had already started this learning project before it became available, so I have continued with it. I'm not interested in any kind of bulk processing, it's purely a one-epub-at-a-time tweaker. It is unlikely to ever see the light of day in the general community as kiwidude is a far better programmer than I will ever be.
jackie_w is offline   Reply With Quote
Old 07-21-2011, 11:19 AM   #7
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,212
Karma: 16534894
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
Quote:
Originally Posted by kovidgoyal View Post
I would recommend using lxml, that's what the calibre conversion engine uses. EPUB is not actually HTML, it is XHTML, so you want a parser that has good XML support and support for namespaces.
I see you nipped in whilst I was trying to explain what I'm trying to do. Thank you
jackie_w is offline   Reply With Quote
Old 07-21-2011, 11:23 AM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If you want to parse CSS you use cssutils, which is what calibre uses.
kovidgoyal is offline   Reply With Quote
Old 07-21-2011, 11:30 AM   #9
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,212
Karma: 16534894
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
Thanks, I've obviously not been poking around the right places so far.
jackie_w is offline   Reply With Quote
Old 07-21-2011, 11:40 AM   #10
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
lxml is much faster and uses less memory than beautiful soup.

Also look at calibre.ebooks.oeb.stylizer it is used a lot during conversation and makes handling CSS easy. Look at oeb2html as part of HTMLZ for a somewhat simple example.
user_none is offline   Reply With Quote
Old 07-21-2011, 12:30 PM   #11
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,212
Karma: 16534894
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
Thanks for the practical examples. I find Python documentation hard-going when trying to get started.
jackie_w is offline   Reply With Quote
Old 07-22-2011, 03:20 PM   #12
Agama
Guru
Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.
 
Agama's Avatar
 
Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
Quote:
Originally Posted by jackie_w View Post
It aims to automate some of the things I find myself doing frequently whilst in Tweak-epub. It also allows me to use some calibre-conversion features without actually doing an epub-epub conversion. (I am currently able to use most of its features on epubs not yet in calibre, even if calibre is not open, so I haven't made good use of existing calibre classes and methods.)

As such I need the ability to analyse the opf, xpgt, css and html files and make tweaks as required. What I could really use is a CSS parser, regex works but is a bit messy.
I'm really interested in what you're doing here as I am going down the same route, processing ePubs to automate post-conversion changes. So if you get things working I would be interested to see what you come up with.
Agama is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Info to start working with Calibre source samphan Development 25 01-20-2012 05:00 AM
Could Calibre pull metadate info from Goodreads? Arainais Calibre 1 01-07-2011 10:58 AM
adding books to nook with Calibre - info not showing up bcarlson Barnes & Noble NOOK 2 01-04-2011 08:59 PM
Calibre not overwriting ePub CSS font info jswinden Calibre 4 04-06-2010 01:29 PM
Different info between reader and Calibre db eyp Calibre 8 03-19-2009 03:53 AM


All times are GMT -4. The time now is 08:23 AM.


MobileRead.com is a privately owned, operated and funded community.