How to use new PDF code in reflow.py?

kiwidude · 03-30-2011, 03:11 PM

Apologies for not being able to figure this out but I have never looked through the layers of the conversion code before and there is an awful lot of it

For the Extract ISBN plugin, I followed a suggestion to use some code like this to convert the contents of a book format to a list of text file content:

Code:

    def open_book(self, pathtoebook):
        self.iterator = EbookIterator(pathtoebook)
        self.iterator.__enter__(only_input_plugin=True)
        text = []
        preprocessor = HTMLPreProcessor(None, False)
        for path in self.iterator.spine:
            html = open(path, 'rb').read().decode('utf-8', 'replace')
            html = preprocessor(html, get_preprocess_html=True)
            text.append(html)
        return text

Functionality wise it "works", but for PDFs it can be pretty slow, and as per info from John/Kovid I was advised to instead use some stuff from reflow.py. I would presume that to mean from Kovid's suggestion my own version of PDFDocument to allow me to just scan say the first 10 pages etc. So I'm assuming I would keep the above code for all the non PDF formats, and have something new for just PDF.

I've had a quick look into this but quite frankly I'm rather unsure of where to begin. The above code does an awful lot of "stuff", likely most of which is not relevant to the task I need.

Would anyone be willing to help give me a few hints? I did a little experimenting just to see what the existing code for PDFDocument does, but found that there were so many dependencies it wasn't very easy to bypass to get to the lower level? For instance I can have some code like this in a test program:

Code:

import os
from calibre.ebooks.pdf.reflow import PDFDocument
from calibre.ebooks.conversion.plumber import OptionValues
from calibre.utils.logging import Log
from calibre.constants import plugins
pdfreflow, pdfreflow_err = plugins['pdfreflow']

input_file = 'd:\\test2.pdf'

opts = OptionValues()
opts.verbose = 3 # This has to have a value or else line 505 in reflow.py blows up
opts.debug_dir = 'd:\\temp'

log = Log()
stream = open(input_file, 'rb')
pdfreflow.reflow(stream.read())
xml = open('index.xml', 'rb').read()
PDFDocument(xml, opts, log)

But that leaves me with issues like what to pass in for the OptionValues - you have to set a certain level of .verbose and a .debug_dir or else reflow.py will not execute line 500 which then means line 505 will blow up when it dumps regions. Please I am obviously missing some code to get this to all run in a temporary working directory, and I am not sure what to do "next" to get to the equivalent of the "spine" list of file contents the previous code gives me...

Help much appreciated!

kovidgoyal · 03-30-2011, 04:22 PM

[code]
from calibre.constants import plugins
pdfreflow, pdfreflow_err = plugins['pdfreflow']
input_file = 'd:\\test2.pdf'
stream = open(input_file, 'rb')
pdfreflow.reflow(stream.read())
xml = open('index.xml', 'rb').read()
from lxml import etree
root = etree.fromstring(xml)
txt = etree.tostring(method='text', encoding=unicode)

And run your regexp on txt.

Note that in the enxt calibre release you will be able to do

pdfreflow.reflow(stream, 1, 10)

to only generate xml corresponding to the first 10 pages

meme · 03-30-2011, 04:43 PM

On a somewhat similar issue, is there a way to easily get the Author or Date of a PDF file? I may not want to do it because of the time issue, but as I can easily get Mobi author/dates/etc. it might be nice to also get the info for PDF's that have them. But I don't know if there is a standard. I can fall back to using the author info saved for the book in Calibre, but in some cases I need to read a pdf not in calibre.

kiwidude · 03-30-2011, 04:43 PM

Brilliant, thanks Kovid, I got close, it appears I should have just ignored experimenting with PDFDocument

. Two further questions if I may...

(1) Correct me if wrong but pdfreflow seems to use the current working directory to produce its output in. I can't find "the magic" which would allow using a PersistentTemporaryDirectory as the current working directory? I looked for something like os.chdir in the Calibre code but couldn't spot anything - what is the recommended way?

(2) With your changes to pdfreflow, will it be possible to either know how many pages there are, or be able to specify in some way a range for the end pages? I have found quite a number of PDFs where the ISBN is at the end unfortunately, so in an ideal world I would scan say the first 10 pages and the last 5. If there is no way to do that without scanning the whole document then so be it.

kovidgoyal · 03-30-2011, 05:13 PM

1) from calibre import CurrentDir

with CurrentDir('whatever'):
....

2) Open a ticket for it and I'll add support for negative numbers to the indexing

@meme: From calibre.ebooks.metadata.meta import get_metadata

03-30-2011, 03:11 PM	#1
kiwidude Calibre Plugins Developer Posts: 4,637 Karma: 2162064 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	How to use new PDF code in reflow.py? Apologies for not being able to figure this out but I have never looked through the layers of the conversion code before and there is an awful lot of it For the Extract ISBN plugin, I followed a suggestion to use some code like this to convert the contents of a book format to a list of text file content: Code: def open_book(self, pathtoebook): self.iterator = EbookIterator(pathtoebook) self.iterator.__enter__(only_input_plugin=True) text = [] preprocessor = HTMLPreProcessor(None, False) for path in self.iterator.spine: html = open(path, 'rb').read().decode('utf-8', 'replace') html = preprocessor(html, get_preprocess_html=True) text.append(html) return text Functionality wise it "works", but for PDFs it can be pretty slow, and as per info from John/Kovid I was advised to instead use some stuff from reflow.py. I would presume that to mean from Kovid's suggestion my own version of PDFDocument to allow me to just scan say the first 10 pages etc. So I'm assuming I would keep the above code for all the non PDF formats, and have something new for just PDF. I've had a quick look into this but quite frankly I'm rather unsure of where to begin. The above code does an awful lot of "stuff", likely most of which is not relevant to the task I need. Would anyone be willing to help give me a few hints? I did a little experimenting just to see what the existing code for PDFDocument does, but found that there were so many dependencies it wasn't very easy to bypass to get to the lower level? For instance I can have some code like this in a test program: Code: import os from calibre.ebooks.pdf.reflow import PDFDocument from calibre.ebooks.conversion.plumber import OptionValues from calibre.utils.logging import Log from calibre.constants import plugins pdfreflow, pdfreflow_err = plugins['pdfreflow'] input_file = 'd:\\test2.pdf' opts = OptionValues() opts.verbose = 3 # This has to have a value or else line 505 in reflow.py blows up opts.debug_dir = 'd:\\temp' log = Log() stream = open(input_file, 'rb') pdfreflow.reflow(stream.read()) xml = open('index.xml', 'rb').read() PDFDocument(xml, opts, log) But that leaves me with issues like what to pass in for the OptionValues - you have to set a certain level of .verbose and a .debug_dir or else reflow.py will not execute line 500 which then means line 505 will blow up when it dumps regions. Please I am obviously missing some code to get this to all run in a temporary working directory, and I am not sure what to do "next" to get to the equivalent of the "spine" list of file contents the previous code gives me... Help much appreciated!

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
What is the best way to reflow the PDF	AxAx	Amazon Kindle	13	09-01-2011 11:50 PM
PDF Reflow	Rebo	Nook Color & Nook Tablet	0	11-22-2010 04:46 PM
PDF Reflow?	omro	Apple Devices	5	05-14-2010 02:49 AM
Comparison classic PDF vs PDF reflow	josecastanon1	Sony Reader	1	10-14-2008 09:59 PM
PDF reflow	=X=	Sony Reader	0	07-30-2008 01:21 PM

03-30-2011, 04:22 PM	#2
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	[code] from calibre.constants import plugins pdfreflow, pdfreflow_err = plugins['pdfreflow'] input_file = 'd:\\test2.pdf' stream = open(input_file, 'rb') pdfreflow.reflow(stream.read()) xml = open('index.xml', 'rb').read() from lxml import etree root = etree.fromstring(xml) txt = etree.tostring(method='text', encoding=unicode) And run your regexp on txt. Note that in the enxt calibre release you will be able to do pdfreflow.reflow(stream, 1, 10) to only generate xml corresponding to the first 10 pages

03-30-2011, 04:43 PM	#3
meme Sigil developer Posts: 1,275 Karma: 1101600 Join Date: Jan 2011 Location: UK Device: Kindle PW, K4 NT, K3, Kobo Touch	On a somewhat similar issue, is there a way to easily get the Author or Date of a PDF file? I may not want to do it because of the time issue, but as I can easily get Mobi author/dates/etc. it might be nice to also get the info for PDF's that have them. But I don't know if there is a standard. I can fall back to using the author info saved for the book in Calibre, but in some cases I need to read a pdf not in calibre.

03-30-2011, 05:13 PM	#5
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	1) from calibre import CurrentDir with CurrentDir('whatever'): .... 2) Open a ticket for it and I'll add support for negative numbers to the indexing @meme: From calibre.ebooks.metadata.meta import get_metadata

Advert

Advert