MobileRead Forums - View Single Post

kiwidude · 03-30-2011, 03:11 PM

Apologies for not being able to figure this out but I have never looked through the layers of the conversion code before and there is an awful lot of it

For the Extract ISBN plugin, I followed a suggestion to use some code like this to convert the contents of a book format to a list of text file content:

Code:

    def open_book(self, pathtoebook):
        self.iterator = EbookIterator(pathtoebook)
        self.iterator.__enter__(only_input_plugin=True)
        text = []
        preprocessor = HTMLPreProcessor(None, False)
        for path in self.iterator.spine:
            html = open(path, 'rb').read().decode('utf-8', 'replace')
            html = preprocessor(html, get_preprocess_html=True)
            text.append(html)
        return text

Functionality wise it "works", but for PDFs it can be pretty slow, and as per info from John/Kovid I was advised to instead use some stuff from reflow.py. I would presume that to mean from Kovid's suggestion my own version of PDFDocument to allow me to just scan say the first 10 pages etc. So I'm assuming I would keep the above code for all the non PDF formats, and have something new for just PDF.

I've had a quick look into this but quite frankly I'm rather unsure of where to begin. The above code does an awful lot of "stuff", likely most of which is not relevant to the task I need.

Would anyone be willing to help give me a few hints? I did a little experimenting just to see what the existing code for PDFDocument does, but found that there were so many dependencies it wasn't very easy to bypass to get to the lower level? For instance I can have some code like this in a test program:

Code:

import os
from calibre.ebooks.pdf.reflow import PDFDocument
from calibre.ebooks.conversion.plumber import OptionValues
from calibre.utils.logging import Log
from calibre.constants import plugins
pdfreflow, pdfreflow_err = plugins['pdfreflow']

input_file = 'd:\\test2.pdf'

opts = OptionValues()
opts.verbose = 3 # This has to have a value or else line 505 in reflow.py blows up
opts.debug_dir = 'd:\\temp'

log = Log()
stream = open(input_file, 'rb')
pdfreflow.reflow(stream.read())
xml = open('index.xml', 'rb').read()
PDFDocument(xml, opts, log)

But that leaves me with issues like what to pass in for the OptionValues - you have to set a certain level of .verbose and a .debug_dir or else reflow.py will not execute line 500 which then means line 505 will blow up when it dumps regions. Please I am obviously missing some code to get this to all run in a temporary working directory, and I am not sure what to do "next" to get to the equivalent of the "spine" list of file contents the previous code gives me...

Help much appreciated!

03-30-2011, 03:11 PM	#1
kiwidude Calibre Plugins Developer Posts: 4,730 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	How to use new PDF code in reflow.py? Apologies for not being able to figure this out but I have never looked through the layers of the conversion code before and there is an awful lot of it For the Extract ISBN plugin, I followed a suggestion to use some code like this to convert the contents of a book format to a list of text file content: Code: def open_book(self, pathtoebook): self.iterator = EbookIterator(pathtoebook) self.iterator.__enter__(only_input_plugin=True) text = [] preprocessor = HTMLPreProcessor(None, False) for path in self.iterator.spine: html = open(path, 'rb').read().decode('utf-8', 'replace') html = preprocessor(html, get_preprocess_html=True) text.append(html) return text Functionality wise it "works", but for PDFs it can be pretty slow, and as per info from John/Kovid I was advised to instead use some stuff from reflow.py. I would presume that to mean from Kovid's suggestion my own version of PDFDocument to allow me to just scan say the first 10 pages etc. So I'm assuming I would keep the above code for all the non PDF formats, and have something new for just PDF. I've had a quick look into this but quite frankly I'm rather unsure of where to begin. The above code does an awful lot of "stuff", likely most of which is not relevant to the task I need. Would anyone be willing to help give me a few hints? I did a little experimenting just to see what the existing code for PDFDocument does, but found that there were so many dependencies it wasn't very easy to bypass to get to the lower level? For instance I can have some code like this in a test program: Code: import os from calibre.ebooks.pdf.reflow import PDFDocument from calibre.ebooks.conversion.plumber import OptionValues from calibre.utils.logging import Log from calibre.constants import plugins pdfreflow, pdfreflow_err = plugins['pdfreflow'] input_file = 'd:\\test2.pdf' opts = OptionValues() opts.verbose = 3 # This has to have a value or else line 505 in reflow.py blows up opts.debug_dir = 'd:\\temp' log = Log() stream = open(input_file, 'rb') pdfreflow.reflow(stream.read()) xml = open('index.xml', 'rb').read() PDFDocument(xml, opts, log) But that leaves me with issues like what to pass in for the OptionValues - you have to set a certain level of .verbose and a .debug_dir or else reflow.py will not execute line 500 which then means line 505 will blow up when it dumps regions. Please I am obviously missing some code to get this to all run in a temporary working directory, and I am not sure what to do "next" to get to the equivalent of the "spine" list of file contents the previous code gives me... Help much appreciated!