Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Development

Notices

Reply
 
Thread Tools Search this Thread
Old 03-30-2011, 03:11 PM   #1
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
How to use new PDF code in reflow.py?

Apologies for not being able to figure this out but I have never looked through the layers of the conversion code before and there is an awful lot of it

For the Extract ISBN plugin, I followed a suggestion to use some code like this to convert the contents of a book format to a list of text file content:
Code:
    def open_book(self, pathtoebook):
        self.iterator = EbookIterator(pathtoebook)
        self.iterator.__enter__(only_input_plugin=True)
        text = []
        preprocessor = HTMLPreProcessor(None, False)
        for path in self.iterator.spine:
            html = open(path, 'rb').read().decode('utf-8', 'replace')
            html = preprocessor(html, get_preprocess_html=True)
            text.append(html)
        return text
Functionality wise it "works", but for PDFs it can be pretty slow, and as per info from John/Kovid I was advised to instead use some stuff from reflow.py. I would presume that to mean from Kovid's suggestion my own version of PDFDocument to allow me to just scan say the first 10 pages etc. So I'm assuming I would keep the above code for all the non PDF formats, and have something new for just PDF.

I've had a quick look into this but quite frankly I'm rather unsure of where to begin. The above code does an awful lot of "stuff", likely most of which is not relevant to the task I need.

Would anyone be willing to help give me a few hints? I did a little experimenting just to see what the existing code for PDFDocument does, but found that there were so many dependencies it wasn't very easy to bypass to get to the lower level? For instance I can have some code like this in a test program:
Code:
import os
from calibre.ebooks.pdf.reflow import PDFDocument
from calibre.ebooks.conversion.plumber import OptionValues
from calibre.utils.logging import Log
from calibre.constants import plugins
pdfreflow, pdfreflow_err = plugins['pdfreflow']

input_file = 'd:\\test2.pdf'

opts = OptionValues()
opts.verbose = 3 # This has to have a value or else line 505 in reflow.py blows up
opts.debug_dir = 'd:\\temp'

log = Log()
stream = open(input_file, 'rb')
pdfreflow.reflow(stream.read())
xml = open('index.xml', 'rb').read()
PDFDocument(xml, opts, log)
But that leaves me with issues like what to pass in for the OptionValues - you have to set a certain level of .verbose and a .debug_dir or else reflow.py will not execute line 500 which then means line 505 will blow up when it dumps regions. Please I am obviously missing some code to get this to all run in a temporary working directory, and I am not sure what to do "next" to get to the equivalent of the "spine" list of file contents the previous code gives me...

Help much appreciated!
kiwidude is online now   Reply With Quote
Old 03-30-2011, 04:22 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
[code]
from calibre.constants import plugins
pdfreflow, pdfreflow_err = plugins['pdfreflow']
input_file = 'd:\\test2.pdf'
stream = open(input_file, 'rb')
pdfreflow.reflow(stream.read())
xml = open('index.xml', 'rb').read()
from lxml import etree
root = etree.fromstring(xml)
txt = etree.tostring(method='text', encoding=unicode)

And run your regexp on txt.


Note that in the enxt calibre release you will be able to do

pdfreflow.reflow(stream, 1, 10)

to only generate xml corresponding to the first 10 pages
kovidgoyal is offline   Reply With Quote
Advert
Old 03-30-2011, 04:43 PM   #3
meme
Sigil developer
meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.meme ought to be getting tired of karma fortunes by now.
 
Posts: 1,275
Karma: 1101600
Join Date: Jan 2011
Location: UK
Device: Kindle PW, K4 NT, K3, Kobo Touch
On a somewhat similar issue, is there a way to easily get the Author or Date of a PDF file? I may not want to do it because of the time issue, but as I can easily get Mobi author/dates/etc. it might be nice to also get the info for PDF's that have them. But I don't know if there is a standard. I can fall back to using the author info saved for the book in Calibre, but in some cases I need to read a pdf not in calibre.
meme is offline   Reply With Quote
Old 03-30-2011, 04:43 PM   #4
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Brilliant, thanks Kovid, I got close, it appears I should have just ignored experimenting with PDFDocument . Two further questions if I may...

(1) Correct me if wrong but pdfreflow seems to use the current working directory to produce its output in. I can't find "the magic" which would allow using a PersistentTemporaryDirectory as the current working directory? I looked for something like os.chdir in the Calibre code but couldn't spot anything - what is the recommended way?

(2) With your changes to pdfreflow, will it be possible to either know how many pages there are, or be able to specify in some way a range for the end pages? I have found quite a number of PDFs where the ISBN is at the end unfortunately, so in an ideal world I would scan say the first 10 pages and the last 5. If there is no way to do that without scanning the whole document then so be it.
kiwidude is online now   Reply With Quote
Old 03-30-2011, 05:13 PM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
1) from calibre import CurrentDir

with CurrentDir('whatever'):
....

2) Open a ticket for it and I'll add support for negative numbers to the indexing


@meme: From calibre.ebooks.metadata.meta import get_metadata
kovidgoyal is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
What is the best way to reflow the PDF AxAx Amazon Kindle 13 09-01-2011 11:50 PM
PDF Reflow Rebo Nook Color & Nook Tablet 0 11-22-2010 04:46 PM
PDF Reflow? omro Apple Devices 5 05-14-2010 02:49 AM
Comparison classic PDF vs PDF reflow josecastanon1 Sony Reader 1 10-14-2008 09:59 PM
PDF reflow =X= Sony Reader 0 07-30-2008 01:21 PM


All times are GMT -4. The time now is 09:03 AM.


MobileRead.com is a privately owned, operated and funded community.