Quote:
Originally Posted by davidfor
Separating the page and word count for PDF might make sense.
|
I have planned to do a pdf metadata plugin that would extract everything it could find. Page count is on the list, and is trivial. But I don't know when I'll get around to it -- maybe end of next month.
Quote:
How often does it actually fail.
|
I've had two or three PDFs fail, a couple of markdowns fail (!!) and one or two extremely large epubs with lots of pictures fail.
For the pdfs -- I wonder if it would be possible to modify it to extract the text directly with a pdf tool instead of using pdf2html.
edit: And yes, at least some of those really did fail. One created >5G of images from a <20M pdf and filled up /tmp.