MobileRead Forums - View Single Post

kovidgoyal · 03-30-2011, 05:22 PM

[code]
from calibre.constants import plugins
pdfreflow, pdfreflow_err = plugins['pdfreflow']
input_file = 'd:\\test2.pdf'
stream = open(input_file, 'rb')
pdfreflow.reflow(stream.read())
xml = open('index.xml', 'rb').read()
from lxml import etree
root = etree.fromstring(xml)
txt = etree.tostring(method='text', encoding=unicode)

And run your regexp on txt.

Note that in the enxt calibre release you will be able to do

pdfreflow.reflow(stream, 1, 10)

to only generate xml corresponding to the first 10 pages

03-30-2011, 05:22 PM	#2
kovidgoyal creator of calibre Posts: 45,627 Karma: 28549046 Join Date: Oct 2006 Location: Mumbai, India Device: Various	[code] from calibre.constants import plugins pdfreflow, pdfreflow_err = plugins['pdfreflow'] input_file = 'd:\\test2.pdf' stream = open(input_file, 'rb') pdfreflow.reflow(stream.read()) xml = open('index.xml', 'rb').read() from lxml import etree root = etree.fromstring(xml) txt = etree.tostring(method='text', encoding=unicode) And run your regexp on txt. Note that in the enxt calibre release you will be able to do pdfreflow.reflow(stream, 1, 10) to only generate xml corresponding to the first 10 pages