View Single Post
Old 03-30-2011, 04:22 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,410
Karma: 27757236
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
[code]
from calibre.constants import plugins
pdfreflow, pdfreflow_err = plugins['pdfreflow']
input_file = 'd:\\test2.pdf'
stream = open(input_file, 'rb')
pdfreflow.reflow(stream.read())
xml = open('index.xml', 'rb').read()
from lxml import etree
root = etree.fromstring(xml)
txt = etree.tostring(method='text', encoding=unicode)

And run your regexp on txt.


Note that in the enxt calibre release you will be able to do

pdfreflow.reflow(stream, 1, 10)

to only generate xml corresponding to the first 10 pages
kovidgoyal is offline   Reply With Quote