importing ebook and extracting content

bibihoma · 10-24-2013, 10:48 AM

Hi!
[ skip this if you are in a hurry ....

I am using calibre for months (without any plans to dig into its code) and recently got the idea of an application helping to learn vocabulary, using ebooks as a data base of "in context" translations.

Unfortunately, my development skills are a bit rust and it is taking me longer than I though to develop this django application. Also my few tentative of developping myself a .fb2 paragraph and section extractor demonstrate, that I would better re-use what was already done.

Anyway... enough context, let's get into the request itself: calibre is not only a great library/ebook converter/... , it also seems to be the python reference for ebook content extraction. Unfortunately, it is not published as a standalone module, and its code is just huge!

My understanding is that everybook will be mapped to ebooks.oeb.base at some point in the conversion chain. So according to you, shall I try to instanciate ebooks.oeb.base and use it extract ebook information? If so, I would appreciate if you could redirect me to information that could help/similar code if you know some.

Alternatively, I tried to have a look at the Calibre viewer as it requires to access the ebook content (like my application): the calibre gui2 viewer main.py - load_ebook function seems a good example.
https://github.com/bibihoma/calibre/...viewer/main.py ( load_ebook function). This suggest that I should rathermore use calibre.ebooks.oeb.iterator.book to navigate within a book.
Any comment on what is the best approach?

In case someone reads this post until this point,]

the short question is: given an ebook path, how to load the ebook in a python structure and access its chapters and paragraphs in sequence?

Thanks, bibihoma

kovidgoyal · 10-24-2013, 11:58 AM

What approach you use depends on what you are trying to do. If all you want to do is extract the html content from the ebook the either oeb.iterator or oeb.polish.container will work.

bibihoma · 10-26-2013, 11:00 AM

Thanks for the quick answer.

Unfortunately, I am developping under windows. I tried running setup.py after downloading sources, but after few depencies installation (Qt, ....) I encountered a "Unable to find vcvarsall.bat" which ended my hopes to test your proposition.

Do you have any plan in the future to package calibre into various submodules that could be installed via simple pip install?

$ python setup.py
Traceback (most recent call last):
File "setup.py", line 13, in <module>
import setup.commands as commands
File "c:\Users\MCProject\calibre\setup\commands.py" , line 34, in <module>
from setup.extensions import Build
File "c:\Users\MCProject\calibre\setup\extensions.p y", line 16, in <module>
from setup.build_environment import (chmlib_inc_dirs,
File "c:\Users\MCProject\calibre\setup\build_environmen t.py", line 25, in <module>
msvc.initialize()
File "c:\Python27\lib\distutils\msvc9compiler.py", line 383, in initialize
vc_env = query_vcvarsall(VERSION, plat_spec)
File "c:\Python27\lib\distutils\msvc9compiler.py", line 271, in query_vcvarsall
raise DistutilsPlatformError("Unable to find vcvarsall.bat")
distutils.errors.DistutilsPlatformError: Unable to find vcvarsall.bat

kovidgoyal · 10-26-2013, 11:19 AM

http://manual.calibre-ebook.com/deve...-your-projects

bibihoma · 10-27-2013, 09:03 AM

Thanks, indeed reading AND applying the documentation helps. Sorry for having wasted your time... I will delete my above post above as it has nothing to do with the topic (and it is a shame for me)

Still one more question: the documentation mentions that we can run script with this syntax:
$ calibre-debug manage.py -- --runserver

I am not familiar at all with what calibre-debug, but it does NOT seem to be using the python installed packages. I have django installed on my python 2.7. and the django launch command is working fine (manage.py runserver).

However enclosing the django launch code in calibre-debug fails when trying to import the django package.
"Python function terminated unexpectedly
No module named django.core.management (Error Code: 1)"

Is it because calibre-debug is looking for packages in the calibre src folder only?
If so, is your advise to try copying django package in calibre src folder? or is there any option that can be activated to allow usage of installed python packages?

Thanks

kovidgoyal · 10-27-2013, 09:18 AM

Yes calibre-debug uses only the calibre src folder. You have many options to get around that. First run this

calibre-debug -c "import sys; print sys.path"

If you put your django folder in some folder listed there, it will be used.

Alternately, you can modify mamage.py at the top, to do this:

import sys
sys.path.append('/path/to/django')

before importing any django modules.

bibihoma · 10-28-2013, 10:32 AM

Thanks Kovidgoyal!

I don't know how clean this solution is... but it works

In case someone has to use django and some more libs installed in python, here is a simple script that includes all the python site-packages:

libFoder = "c:/Python27/Lib/site-packages"
for root,dir, files in os.walk(libFoder):
sys.path.append(os.path.join(libFoder, root))

bibihoma · 10-29-2013, 09:59 AM

Re Kovidgoyal,

Sorry, once again, I'll ask for your help to go beyond this first major achievement: extracting the title of the ebook ;-)
I am now trying to extract the content itself. Shall I use load_html?
If so, how can I instantiate a View to pass to this fuction?

Code:

 
from calibre.ebooks.oeb.iterator.book import EbookIterator
from calibre.ebooks.oeb.display.webview import load_html    
    iterator = EbookIterator("C:/Users/MCProject/Dropbox/Colin/DjangAptana/mysite/kdfr.fb2") 
    iterator.__enter__()
    logger.debug(iterator)     
    logger.debug(iterator.opf.title) 
    for doc in iterator.spine:
        print doc
        load_html(doc, view, codec=getattr(doc, 'encoding', 'utf-8'), mime_type=getattr(path,'mime_type', 'text/html'))

Thanks, Bibihoma

kovidgoyal · 10-29-2013, 10:55 AM

You dont load the html you simply open the file and read it in to get the html.

html = open(doc, 'rb').read()

10-24-2013, 10:48 AM	#1
bibihoma Junior Member Posts: 5 Karma: 10 Join Date: Oct 2013 Device: kindle	importing ebook and extracting content Hi! [ skip this if you are in a hurry .... I am using calibre for months (without any plans to dig into its code) and recently got the idea of an application helping to learn vocabulary, using ebooks as a data base of "in context" translations. Unfortunately, my development skills are a bit rust and it is taking me longer than I though to develop this django application. Also my few tentative of developping myself a .fb2 paragraph and section extractor demonstrate, that I would better re-use what was already done. Anyway... enough context, let's get into the request itself: calibre is not only a great library/ebook converter/... , it also seems to be the python reference for ebook content extraction. Unfortunately, it is not published as a standalone module, and its code is just huge! My understanding is that everybook will be mapped to ebooks.oeb.base at some point in the conversion chain. So according to you, shall I try to instanciate ebooks.oeb.base and use it extract ebook information? If so, I would appreciate if you could redirect me to information that could help/similar code if you know some. Alternatively, I tried to have a look at the Calibre viewer as it requires to access the ebook content (like my application): the calibre gui2 viewer main.py - load_ebook function seems a good example. https://github.com/bibihoma/calibre/...viewer/main.py ( load_ebook function). This suggest that I should rathermore use calibre.ebooks.oeb.iterator.book to navigate within a book. Any comment on what is the best approach? In case someone reads this post until this point,] the short question is: given an ebook path, how to load the ebook in a python structure and access its chapters and paragraphs in sequence? Thanks, bibihoma

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Best eBook Reader to sideload content?	JaynarOJ	Which one should I buy?	3	02-14-2012 01:08 PM
iPad Best ebook reader for personal content?	kgian	Apple Devices	20	11-07-2010 09:16 AM
Extracting markups (annotations and highlites) from your ebook!	nrapallo	Fictionwise eBookwise	20	05-11-2010 11:37 PM
eBook content in Canada	jgsmith	News	8	12-22-2009 12:18 AM
Best way to make an ebook from web content ?	sebastienbillard	Workshop	2	11-24-2009 11:13 AM

10-24-2013, 11:58 AM	#2
kovidgoyal creator of calibre Posts: 45,190 Karma: 27110894 Join Date: Oct 2006 Location: Mumbai, India Device: Various	What approach you use depends on what you are trying to do. If all you want to do is extract the html content from the ebook the either oeb.iterator or oeb.polish.container will work.

10-26-2013, 11:00 AM	#3
bibihoma Junior Member Posts: 5 Karma: 10 Join Date: Oct 2013 Device: kindle	Thanks for the quick answer. Unfortunately, I am developping under windows. I tried running setup.py after downloading sources, but after few depencies installation (Qt, ....) I encountered a "Unable to find vcvarsall.bat" which ended my hopes to test your proposition. Do you have any plan in the future to package calibre into various submodules that could be installed via simple pip install? $ python setup.py Traceback (most recent call last): File "setup.py", line 13, in <module> import setup.commands as commands File "c:\Users\MCProject\calibre\setup\commands.py" , line 34, in <module> from setup.extensions import Build File "c:\Users\MCProject\calibre\setup\extensions.p y", line 16, in <module> from setup.build_environment import (chmlib_inc_dirs, File "c:\Users\MCProject\calibre\setup\build_environmen t.py", line 25, in <module> msvc.initialize() File "c:\Python27\lib\distutils\msvc9compiler.py", line 383, in initialize vc_env = query_vcvarsall(VERSION, plat_spec) File "c:\Python27\lib\distutils\msvc9compiler.py", line 271, in query_vcvarsall raise DistutilsPlatformError("Unable to find vcvarsall.bat") distutils.errors.DistutilsPlatformError: Unable to find vcvarsall.bat

10-26-2013, 11:19 AM	#4
kovidgoyal creator of calibre Posts: 45,190 Karma: 27110894 Join Date: Oct 2006 Location: Mumbai, India Device: Various	http://manual.calibre-ebook.com/deve...-your-projects

10-27-2013, 09:03 AM	#5
bibihoma Junior Member Posts: 5 Karma: 10 Join Date: Oct 2013 Device: kindle	Thanks, indeed reading AND applying the documentation helps. Sorry for having wasted your time... I will delete my above post above as it has nothing to do with the topic (and it is a shame for me) Still one more question: the documentation mentions that we can run script with this syntax: $ calibre-debug manage.py -- --runserver I am not familiar at all with what calibre-debug, but it does NOT seem to be using the python installed packages. I have django installed on my python 2.7. and the django launch command is working fine (manage.py runserver). However enclosing the django launch code in calibre-debug fails when trying to import the django package. "Python function terminated unexpectedly No module named django.core.management (Error Code: 1)" Is it because calibre-debug is looking for packages in the calibre src folder only? If so, is your advise to try copying django package in calibre src folder? or is there any option that can be activated to allow usage of installed python packages? Thanks

10-27-2013, 09:18 AM	#6
kovidgoyal creator of calibre Posts: 45,190 Karma: 27110894 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Yes calibre-debug uses only the calibre src folder. You have many options to get around that. First run this calibre-debug -c "import sys; print sys.path" If you put your django folder in some folder listed there, it will be used. Alternately, you can modify mamage.py at the top, to do this: import sys sys.path.append('/path/to/django') before importing any django modules.

10-28-2013, 10:32 AM	#7
bibihoma Junior Member Posts: 5 Karma: 10 Join Date: Oct 2013 Device: kindle	Thanks Kovidgoyal! I don't know how clean this solution is... but it works In case someone has to use django and some more libs installed in python, here is a simple script that includes all the python site-packages: libFoder = "c:/Python27/Lib/site-packages" for root,dir, files in os.walk(libFoder): sys.path.append(os.path.join(libFoder, root))

10-29-2013, 10:55 AM	#9
kovidgoyal creator of calibre Posts: 45,190 Karma: 27110894 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You dont load the html you simply open the file and read it in to get the html. html = open(doc, 'rb').read()

Advert

Advert