![]() |
#1 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,249
Karma: 16539642
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
|
Calibre and parsers, info please
Please could someone guide me in the right direction.
I'm still feeling my way with Python and object-oriented stuff in general. To date, when I have been analysing epub opfs and occasionally htmls, I have achieved what I needed using regex. However, on poking around calibre source I see parsers being used, namely BeautifulSoup and lxml etree. I haven't used a parser before, but it looks like something I ought to explore. What I would like to know is, under what circumstances might I choose to use BeautifulSoup rather than lxml etree, and vice versa? |
![]() |
![]() |
![]() |
#2 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,249
Karma: 16539642
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
|
Thank you, Starson.
As I am currently interested in epub innards (not necessarily calibre-created) it sounds like I should start with lxml. |
![]() |
![]() |
![]() |
#4 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
I work on recipes a lot, so I'm much more familiar with BeautifulSoup than lxml, although I've run into both. I'm sure Kovid will comment if you're on the wrong track, although you might want to flesh out what you'll be doing.
|
![]() |
![]() |
![]() |
#5 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,219
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I would recommend using lxml, that's what the calibre conversion engine uses. EPUB is not actually HTML, it is XHTML, so you want a parser that has good XML support and support for namespaces.
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,249
Karma: 16539642
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
|
I'm working on a personal project which could best be described as an extension to Tweak-epub. It started life as a result of an enhancement request (#8252) I made prior to the current bug system being introduced.
It aims to automate some of the things I find myself doing frequently whilst in Tweak-epub. It also allows me to use some calibre-conversion features without actually doing an epub-epub conversion. (I am currently able to use most of its features on epubs not yet in calibre, even if calibre is not open, so I haven't made good use of existing calibre classes and methods.) As such I need the ability to analyse the opf, xpgt, css and html files and make tweaks as required. What I could really use is a CSS parser, regex works but is a bit messy. I am aware that there is a fair amount of overlap with what kiwidude is doing with his 'Modify epub' plugin. However, I had already started this learning project before it became available, so I have continued with it. I'm not interested in any kind of bulk processing, it's purely a one-epub-at-a-time tweaker. It is unlikely to ever see the light of day in the general community as kiwidude is a far better programmer than I will ever be. |
![]() |
![]() |
![]() |
#7 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,249
Karma: 16539642
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
|
Quote:
![]() |
|
![]() |
![]() |
![]() |
#8 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,219
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
If you want to parse CSS you use cssutils, which is what calibre uses.
|
![]() |
![]() |
![]() |
#9 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,249
Karma: 16539642
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
|
Thanks, I've obviously not been poking around the right places so far.
|
![]() |
![]() |
![]() |
#10 |
Sigil & calibre developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
lxml is much faster and uses less memory than beautiful soup.
Also look at calibre.ebooks.oeb.stylizer it is used a lot during conversation and makes handling CSS easy. Look at oeb2html as part of HTMLZ for a somewhat simple example. |
![]() |
![]() |
![]() |
#11 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,249
Karma: 16539642
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
|
Thanks for the practical examples. I find Python documentation hard-going when trying to get started.
|
![]() |
![]() |
![]() |
#12 | |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
|
Quote:
|
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Info to start working with Calibre source | samphan | Development | 25 | 01-20-2012 05:00 AM |
Could Calibre pull metadate info from Goodreads? | Arainais | Calibre | 1 | 01-07-2011 10:58 AM |
adding books to nook with Calibre - info not showing up | bcarlson | Barnes & Noble NOOK | 2 | 01-04-2011 08:59 PM |
Calibre not overwriting ePub CSS font info | jswinden | Calibre | 4 | 04-06-2010 01:29 PM |
Different info between reader and Calibre db | eyp | Calibre | 8 | 03-19-2009 03:53 AM |