[code]
from calibre.constants import plugins
pdfreflow, pdfreflow_err = plugins['pdfreflow']
input_file = 'd:\\test2.pdf'
stream = open(input_file, 'rb')
pdfreflow.reflow(stream.read())
xml = open('index.xml', 'rb').read()
from lxml import etree
root = etree.fromstring(xml)
txt = etree.tostring(method='text', encoding=unicode)
And run your regexp on txt.
Note that in the enxt calibre release you will be able to do
pdfreflow.reflow(stream, 1, 10)
to only generate xml corresponding to the first 10 pages
|