Thread: PDF Input
View Single Post
Old 04-24-2010, 02:24 AM   #1
asjogren
Addict
asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.
 
Posts: 266
Karma: 1378
Join Date: Dec 2009
Location: Seattle / San Carlos, Sonora, Mexico
Device: Kindle & WiFi Nook & PocketBook IQ
PDF Input

I looked, but did not find a tutorial on PDF input. Is there one that I just did not find?

PDF has the worst results using defaults. I do understand that this is difficult to get right by default from PDF. I think that I can tweak the "Structure Detection" to get better output.

The last book I converted from PDF, a typical page (other than a Chapter beginning, or the beginning of the book) has either the page number centered or the Title of the book centered at the top of the page - depending on odd/even page numbers.

Chapter headings begin with the word CHAPTER followed by the number. This is centered. There are a variable number of Chapter sub headings - and these too are centered. These are NOT at the top of a page.

Some getting started expressions would help a lot. Or, a pointer to existing documentation that you found useful. Given a start, I can expand using the Python Regular Expression document.

I don't need perfect.
asjogren is offline   Reply With Quote