03-30-2011, 03:11 PM | #1 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
How to use new PDF code in reflow.py?
Apologies for not being able to figure this out but I have never looked through the layers of the conversion code before and there is an awful lot of it
For the Extract ISBN plugin, I followed a suggestion to use some code like this to convert the contents of a book format to a list of text file content: Code:
def open_book(self, pathtoebook): self.iterator = EbookIterator(pathtoebook) self.iterator.__enter__(only_input_plugin=True) text = [] preprocessor = HTMLPreProcessor(None, False) for path in self.iterator.spine: html = open(path, 'rb').read().decode('utf-8', 'replace') html = preprocessor(html, get_preprocess_html=True) text.append(html) return text I've had a quick look into this but quite frankly I'm rather unsure of where to begin. The above code does an awful lot of "stuff", likely most of which is not relevant to the task I need. Would anyone be willing to help give me a few hints? I did a little experimenting just to see what the existing code for PDFDocument does, but found that there were so many dependencies it wasn't very easy to bypass to get to the lower level? For instance I can have some code like this in a test program: Code:
import os from calibre.ebooks.pdf.reflow import PDFDocument from calibre.ebooks.conversion.plumber import OptionValues from calibre.utils.logging import Log from calibre.constants import plugins pdfreflow, pdfreflow_err = plugins['pdfreflow'] input_file = 'd:\\test2.pdf' opts = OptionValues() opts.verbose = 3 # This has to have a value or else line 505 in reflow.py blows up opts.debug_dir = 'd:\\temp' log = Log() stream = open(input_file, 'rb') pdfreflow.reflow(stream.read()) xml = open('index.xml', 'rb').read() PDFDocument(xml, opts, log) Help much appreciated! |
03-30-2011, 04:22 PM | #2 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
[code]
from calibre.constants import plugins pdfreflow, pdfreflow_err = plugins['pdfreflow'] input_file = 'd:\\test2.pdf' stream = open(input_file, 'rb') pdfreflow.reflow(stream.read()) xml = open('index.xml', 'rb').read() from lxml import etree root = etree.fromstring(xml) txt = etree.tostring(method='text', encoding=unicode) And run your regexp on txt. Note that in the enxt calibre release you will be able to do pdfreflow.reflow(stream, 1, 10) to only generate xml corresponding to the first 10 pages |
Advert | |
|
03-30-2011, 04:43 PM | #3 |
Sigil developer
Posts: 1,275
Karma: 1101600
Join Date: Jan 2011
Location: UK
Device: Kindle PW, K4 NT, K3, Kobo Touch
|
On a somewhat similar issue, is there a way to easily get the Author or Date of a PDF file? I may not want to do it because of the time issue, but as I can easily get Mobi author/dates/etc. it might be nice to also get the info for PDF's that have them. But I don't know if there is a standard. I can fall back to using the author info saved for the book in Calibre, but in some cases I need to read a pdf not in calibre.
|
03-30-2011, 04:43 PM | #4 |
Calibre Plugins Developer
Posts: 4,637
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Brilliant, thanks Kovid, I got close, it appears I should have just ignored experimenting with PDFDocument . Two further questions if I may...
(1) Correct me if wrong but pdfreflow seems to use the current working directory to produce its output in. I can't find "the magic" which would allow using a PersistentTemporaryDirectory as the current working directory? I looked for something like os.chdir in the Calibre code but couldn't spot anything - what is the recommended way? (2) With your changes to pdfreflow, will it be possible to either know how many pages there are, or be able to specify in some way a range for the end pages? I have found quite a number of PDFs where the ISBN is at the end unfortunately, so in an ideal world I would scan say the first 10 pages and the last 5. If there is no way to do that without scanning the whole document then so be it. |
03-30-2011, 05:13 PM | #5 |
creator of calibre
Posts: 43,858
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
1) from calibre import CurrentDir
with CurrentDir('whatever'): .... 2) Open a ticket for it and I'll add support for negative numbers to the indexing @meme: From calibre.ebooks.metadata.meta import get_metadata |
Advert | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
What is the best way to reflow the PDF | AxAx | Amazon Kindle | 13 | 09-01-2011 11:50 PM |
PDF Reflow | Rebo | Nook Color & Nook Tablet | 0 | 11-22-2010 04:46 PM |
PDF Reflow? | omro | Apple Devices | 5 | 05-14-2010 02:49 AM |
Comparison classic PDF vs PDF reflow | josecastanon1 | Sony Reader | 1 | 10-14-2008 09:59 PM |
PDF reflow | =X= | Sony Reader | 0 | 07-30-2008 01:21 PM |