PDF Input
I looked, but did not find a tutorial on PDF input. Is there one that I just did not find?
PDF has the worst results using defaults. I do understand that this is difficult to get right by default from PDF. I think that I can tweak the "Structure Detection" to get better output.
The last book I converted from PDF, a typical page (other than a Chapter beginning, or the beginning of the book) has either the page number centered or the Title of the book centered at the top of the page - depending on odd/even page numbers.
Chapter headings begin with the word CHAPTER followed by the number. This is centered. There are a variable number of Chapter sub headings - and these too are centered. These are NOT at the top of a page.
Some getting started expressions would help a lot. Or, a pointer to existing documentation that you found useful. Given a start, I can expand using the Python Regular Expression document.
I don't need perfect.
|