Thread: PDF Input
View Single Post
Old 04-25-2010, 11:04 PM   #9
asjogren
Addict
asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.
 
Posts: 266
Karma: 1378
Join Date: Dec 2009
Location: Seattle / San Carlos, Sonora, Mexico
Device: Kindle & WiFi Nook & PocketBook IQ
The start of a sample even numbered page:

Code:
                                 T H E  T I T L E  OF  T H E  B O O K

The text follows for the rest of the page as you would normally expect.
Sentence after sentence.  
The end of the page is just like any other.  
It may split words within sentences.
A sample odd numbered page looks as follows:

Code:
                                                   1 5 2

The text follows for the rest of the page as you would normally expect.  Sentence after sentence.  
The end of the page is just like any other.  
It may split words within sentences.

The Regular Expression I used for Header Removal was:
\d\s\d\s\d|\d\s\d|t\sh\se\s+t\s\i\s\t\sl\s+o\sf\s+ t\sh\se\s+b\so\so\sk

Logic:
- Expression 1 "\d\s\d\s\d" looks for 3 digit page numbers with a space between each digit.
- Expression 2 "\d\s\d" looks for 2 digit page number with a space between the 2 digits.
- I do not look for single digit page numbers because there were too many false positives where text was removed erroneously from the book. As it was there were a couple places where I erroneously lost text with Expression 2.
- Expression 3 "t\sh\se\s+t\s\i\s\t\sl\s+o\sf\s+t\sh\se\s+b\so\so \sk" looks for "the title of the book" in lower case with a space between each character and multiple between words in the title.
- Anchoring the Expression to the start and end of the string did not work - as these page headers were embedded within the resulting text, unlike the PDF source document.

People with more experience with Python Regular Expressions are invited to improve on this novice's attempt.

Last edited by asjogren; 04-25-2010 at 11:26 PM. Reason: Format
asjogren is offline   Reply With Quote