![]() |
#1 |
Junior Member
![]() Posts: 5
Karma: 10
Join Date: Oct 2013
Device: kindle
|
importing ebook and extracting content
Hi!
[ skip this if you are in a hurry .... I am using calibre for months (without any plans to dig into its code) and recently got the idea of an application helping to learn vocabulary, using ebooks as a data base of "in context" translations. Unfortunately, my development skills are a bit rust and it is taking me longer than I though to develop this django application. Also my few tentative of developping myself a .fb2 paragraph and section extractor demonstrate, that I would better re-use what was already done. Anyway... enough context, let's get into the request itself: calibre is not only a great library/ebook converter/... , it also seems to be the python reference for ebook content extraction. Unfortunately, it is not published as a standalone module, and its code is just huge! My understanding is that everybook will be mapped to ebooks.oeb.base at some point in the conversion chain. So according to you, shall I try to instanciate ebooks.oeb.base and use it extract ebook information? If so, I would appreciate if you could redirect me to information that could help/similar code if you know some. Alternatively, I tried to have a look at the Calibre viewer as it requires to access the ebook content (like my application): the calibre gui2 viewer main.py - load_ebook function seems a good example. https://github.com/bibihoma/calibre/...viewer/main.py ( load_ebook function). This suggest that I should rathermore use calibre.ebooks.oeb.iterator.book to navigate within a book. Any comment on what is the best approach? In case someone reads this post until this point,] the short question is: given an ebook path, how to load the ebook in a python structure and access its chapters and paragraphs in sequence? Thanks, bibihoma |
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,190
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
What approach you use depends on what you are trying to do. If all you want to do is extract the html content from the ebook the either oeb.iterator or oeb.polish.container will work.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Junior Member
![]() Posts: 5
Karma: 10
Join Date: Oct 2013
Device: kindle
|
Thanks for the quick answer.
Unfortunately, I am developping under windows. I tried running setup.py after downloading sources, but after few depencies installation (Qt, ....) I encountered a "Unable to find vcvarsall.bat" which ended my hopes to test your proposition. Do you have any plan in the future to package calibre into various submodules that could be installed via simple pip install? $ python setup.py Traceback (most recent call last): File "setup.py", line 13, in <module> import setup.commands as commands File "c:\Users\MCProject\calibre\setup\commands.py" , line 34, in <module> from setup.extensions import Build File "c:\Users\MCProject\calibre\setup\extensions.p y", line 16, in <module> from setup.build_environment import (chmlib_inc_dirs, File "c:\Users\MCProject\calibre\setup\build_environmen t.py", line 25, in <module> msvc.initialize() File "c:\Python27\lib\distutils\msvc9compiler.py", line 383, in initialize vc_env = query_vcvarsall(VERSION, plat_spec) File "c:\Python27\lib\distutils\msvc9compiler.py", line 271, in query_vcvarsall raise DistutilsPlatformError("Unable to find vcvarsall.bat") distutils.errors.DistutilsPlatformError: Unable to find vcvarsall.bat |
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,190
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
![]() |
![]() |
![]() |
#5 |
Junior Member
![]() Posts: 5
Karma: 10
Join Date: Oct 2013
Device: kindle
|
Thanks, indeed reading AND applying the documentation helps. Sorry for having wasted your time... I will delete my above post above as it has nothing to do with the topic (and it is a shame for me)
Still one more question: the documentation mentions that we can run script with this syntax: $ calibre-debug manage.py -- --runserver I am not familiar at all with what calibre-debug, but it does NOT seem to be using the python installed packages. I have django installed on my python 2.7. and the django launch command is working fine (manage.py runserver). However enclosing the django launch code in calibre-debug fails when trying to import the django package. "Python function terminated unexpectedly No module named django.core.management (Error Code: 1)" Is it because calibre-debug is looking for packages in the calibre src folder only? If so, is your advise to try copying django package in calibre src folder? or is there any option that can be activated to allow usage of installed python packages? Thanks |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,190
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Yes calibre-debug uses only the calibre src folder. You have many options to get around that. First run this
calibre-debug -c "import sys; print sys.path" If you put your django folder in some folder listed there, it will be used. Alternately, you can modify mamage.py at the top, to do this: import sys sys.path.append('/path/to/django') before importing any django modules. |
![]() |
![]() |
![]() |
#7 |
Junior Member
![]() Posts: 5
Karma: 10
Join Date: Oct 2013
Device: kindle
|
Thanks Kovidgoyal!
I don't know how clean this solution is... but it works ![]() In case someone has to use django and some more libs installed in python, here is a simple script that includes all the python site-packages: libFoder = "c:/Python27/Lib/site-packages" for root,dir, files in os.walk(libFoder): sys.path.append(os.path.join(libFoder, root)) |
![]() |
![]() |
![]() |
#8 |
Junior Member
![]() Posts: 5
Karma: 10
Join Date: Oct 2013
Device: kindle
|
Re Kovidgoyal,
Sorry, once again, I'll ask for your help to go beyond this first major achievement: extracting the title of the ebook ;-) I am now trying to extract the content itself. Shall I use load_html? If so, how can I instantiate a View to pass to this fuction? Code:
from calibre.ebooks.oeb.iterator.book import EbookIterator
from calibre.ebooks.oeb.display.webview import load_html
iterator = EbookIterator("C:/Users/MCProject/Dropbox/Colin/DjangAptana/mysite/kdfr.fb2")
iterator.__enter__()
logger.debug(iterator)
logger.debug(iterator.opf.title)
for doc in iterator.spine:
print doc
load_html(doc, view, codec=getattr(doc, 'encoding', 'utf-8'), mime_type=getattr(path,'mime_type', 'text/html'))
|
![]() |
![]() |
![]() |
#9 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,190
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You dont load the html you simply open the file and read it in to get the html.
html = open(doc, 'rb').read() |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Best eBook Reader to sideload content? | JaynarOJ | Which one should I buy? | 3 | 02-14-2012 01:08 PM |
iPad Best ebook reader for personal content? | kgian | Apple Devices | 20 | 11-07-2010 09:16 AM |
Extracting markups (annotations and highlites) from your ebook! | nrapallo | Fictionwise eBookwise | 20 | 05-11-2010 11:37 PM |
eBook content in Canada | jgsmith | News | 8 | 12-22-2009 12:18 AM |
Best way to make an ebook from web content ? | sebastienbillard | Workshop | 2 | 11-24-2009 11:13 AM |