MobileRead Forums - View Single Post

asjogren · 04-26-2010, 12:04 AM

The start of a sample even numbered page:

Code:

                                 T H E  T I T L E  OF  T H E  B O O K

The text follows for the rest of the page as you would normally expect.
Sentence after sentence.  
The end of the page is just like any other.  
It may split words within sentences.

A sample odd numbered page looks as follows:

Code:

                                                   1 5 2

The text follows for the rest of the page as you would normally expect.  Sentence after sentence.  
The end of the page is just like any other.  
It may split words within sentences.

The Regular Expression I used for Header Removal was:
\d\s\d\s\d|\d\s\d|t\sh\se\s+t\s\i\s\t\sl\s+o\sf\s+ t\sh\se\s+b\so\so\sk

Logic:
- Expression 1 "\d\s\d\s\d" looks for 3 digit page numbers with a space between each digit.
- Expression 2 "\d\s\d" looks for 2 digit page number with a space between the 2 digits.
- I do not look for single digit page numbers because there were too many false positives where text was removed erroneously from the book. As it was there were a couple places where I erroneously lost text with Expression 2.
- Expression 3 "t\sh\se\s+t\s\i\s\t\sl\s+o\sf\s+t\sh\se\s+b\so\so \sk" looks for "the title of the book" in lower case with a space between each character and multiple between words in the title.
- Anchoring the Expression to the start and end of the string did not work - as these page headers were embedded within the resulting text, unlike the PDF source document.

People with more experience with Python Regular Expressions are invited to improve on this novice's attempt.

04-26-2010, 12:04 AM	#9
asjogren Addict Posts: 266 Karma: 1378 Join Date: Dec 2009 Location: Seattle / San Carlos, Sonora, Mexico Device: Kindle & WiFi Nook & PocketBook IQ	The start of a sample even numbered page: Code: T H E T I T L E OF T H E B O O K The text follows for the rest of the page as you would normally expect. Sentence after sentence. The end of the page is just like any other. It may split words within sentences. A sample odd numbered page looks as follows: Code: 1 5 2 The text follows for the rest of the page as you would normally expect. Sentence after sentence. The end of the page is just like any other. It may split words within sentences. The Regular Expression I used for Header Removal was: \d\s\d\s\d\|\d\s\d\|t\sh\se\s+t\s\i\s\t\sl\s+o\sf\s+ t\sh\se\s+b\so\so\sk Logic: - Expression 1 "\d\s\d\s\d" looks for 3 digit page numbers with a space between each digit. - Expression 2 "\d\s\d" looks for 2 digit page number with a space between the 2 digits. - I do not look for single digit page numbers because there were too many false positives where text was removed erroneously from the book. As it was there were a couple places where I erroneously lost text with Expression 2. - Expression 3 "t\sh\se\s+t\s\i\s\t\sl\s+o\sf\s+t\sh\se\s+b\so\so \sk" looks for "the title of the book" in lower case with a space between each character and multiple between words in the title. - Anchoring the Expression to the start and end of the string did not work - as these page headers were embedded within the resulting text, unlike the PDF source document. People with more experience with Python Regular Expressions are invited to improve on this novice's attempt. Last edited by asjogren; 04-26-2010 at 12:26 AM. Reason: Format