Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 04-24-2010, 02:24 AM   #1
asjogren
Addict
asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.
 
Posts: 266
Karma: 1378
Join Date: Dec 2009
Location: Seattle / San Carlos, Sonora, Mexico
Device: Kindle & WiFi Nook & PocketBook IQ
PDF Input

I looked, but did not find a tutorial on PDF input. Is there one that I just did not find?

PDF has the worst results using defaults. I do understand that this is difficult to get right by default from PDF. I think that I can tweak the "Structure Detection" to get better output.

The last book I converted from PDF, a typical page (other than a Chapter beginning, or the beginning of the book) has either the page number centered or the Title of the book centered at the top of the page - depending on odd/even page numbers.

Chapter headings begin with the word CHAPTER followed by the number. This is centered. There are a variable number of Chapter sub headings - and these too are centered. These are NOT at the top of a page.

Some getting started expressions would help a lot. Or, a pointer to existing documentation that you found useful. Given a start, I can expand using the Python Regular Expression document.

I don't need perfect.
asjogren is offline   Reply With Quote
Old 04-24-2010, 03:34 AM   #2
speakingtohe
Wizard
speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.
 
Posts: 4,589
Karma: 25107878
Join Date: Apr 2010
Device: sony PRS-T1 and T3, Kobo Mini and Aura HD, Tablet
Hope This Helps
From Sams Book Teach Yourself Regular Expressions In 10 Minutes
Matching Digits (and Nondigits)
[0-9] is a shortcut for [0123456789] and is used to match any digit. To match anything other than a digit, the set can be negated as [^0-9]. Table 4.2 lists the class shortcuts for digits and nondigits.

Table 4.2. Digit Metacharacters Metacharacter
Description
\d
Any digit (same as [0-9])
\D
Any nondigit (same as [^0-9])
To demonstrate the use of these metacharacters, let's revisit a prior example:
var myArray = new Array();
...
if (myArray[0] == 0) {
...
}
myArray\[\d\]
var myArray = new Array();
...
if (myArray[0] == 0) {
...
}

\[ matches [, \d matches any single digit, and \] matches ], so that myArray\[\d\] matches myArray[0]. myArray\[\d\] is shorthand for myArray\[0-9\], which is shorthand for myArray\[0123456789\]. This regular expression would also have matched myArray[1], myArray[2], and so on (but not myArray[10]).

Last edited by speakingtohe; 04-24-2010 at 03:37 AM.
speakingtohe is offline   Reply With Quote
Old 04-24-2010, 01:43 PM   #3
asjogren
Addict
asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.
 
Posts: 266
Karma: 1378
Join Date: Dec 2009
Location: Seattle / San Carlos, Sonora, Mexico
Device: Kindle & WiFi Nook & PocketBook IQ
So, how do I figure out that I actually am at a page heading? Or, is that implicit? What is the environment? How do I figure out that the text in question is actually centered in the line?

The default regular expressions are quite complex.

I can eventually figure this stuff out myself with a LOT of time. I was hoping to find someone who had been there before me.
asjogren is offline   Reply With Quote
Old 04-24-2010, 06:57 PM   #4
speakingtohe
Wizard
speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.
 
Posts: 4,589
Karma: 25107878
Join Date: Apr 2010
Device: sony PRS-T1 and T3, Kobo Mini and Aura HD, Tablet
I am by no means an expert although I can make it work usually.
When you click on Structure detection and click on one of the magic wands (I just use the Header one)
A dialog box comes up with raw input.
At the top is an example expression.
You have to modify this
The REGEX (example expression) is made up of pattern matching expressions and/or characters you want to match. It does have a finite length I believe but not sure what.


To see how this works copy a bit of text from the preview window and paste it into the Regex line. Click test. This will highlight the text you copied in light grey which is hard to see, but if you click in the preview window it will turn yellow.
If this is a multiple occurring line of text you can scroll down and see it highlighted everywhere it occurs.

I am still not totally understanding the pattern matching so won't confuse you on my conceptions and misconceptions there.

The book I did had a footer or header that contained a web address surrounded by brackets () etc. So I just matched a distinctive part of this and put .'s before and after in until it matched it all. The . (period) matches any character
If you had a line that said (This page is printed) the backets can't be entered in the ordinary way. But .This page is printed. would match it.
Not the most elegant solution especially if you use 37 .'s but quick and dirty is okay on occasion for me

I don't think it matters whether it is a header or a footer in pdf's and or centered or not.
Case does matter. Chapter ...... is not the same as CHAPTER ......

I am pretty new at the Python stuff myself and off an advanced age so not real fast on the uptake, but it isn't impossible, just a bit daunting at times

First step IMO is to type chapter (correct case) into the regex box and click in preview windo and scroll down to see what is highlighted.

You probably know that for pdf's 0.04 is a good default value for line unwrapping.

Just rememeber it is all easy once you have done it.
Helen
speakingtohe is offline   Reply With Quote
Old 04-25-2010, 01:02 AM   #5
asjogren
Addict
asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.
 
Posts: 266
Karma: 1378
Join Date: Dec 2009
Location: Seattle / San Carlos, Sonora, Mexico
Device: Kindle & WiFi Nook & PocketBook IQ
Partial Success! Thank you SpeakingToHe!

Some observations:
1) There appears to be no context that removing page headers affects ONLY PAGE HEADERS in PDF input. There are false positives within the body of the book where matching text is removed.
2) The case of the text is after conversion to XHTML
3) Even though the source PDF had the page headings centered on the line, this was not the case WHEN the pattern matching was applied.

What I had was alternating odd - even pages of page headings, centered. The odd pages had page numbers with blanks between the digits, for example "2 1" and "3 5 1".

The even number pages had a page heading of the book title in upper case with spaces between the letters and multiple spaces between the words, like "T H E T I T L E OF T H E B O O K". However, I had to match the lower case of the book title - with the extra spaces.
asjogren is offline   Reply With Quote
Old 04-25-2010, 04:37 AM   #6
speakingtohe
Wizard
speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.
 
Posts: 4,589
Karma: 25107878
Join Date: Apr 2010
Device: sony PRS-T1 and T3, Kobo Mini and Aura HD, Tablet
Good for you. Let me know how you make out.
Helen
speakingtohe is offline   Reply With Quote
Old 04-25-2010, 04:40 PM   #7
asjogren
Addict
asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.
 
Posts: 266
Karma: 1378
Join Date: Dec 2009
Location: Seattle / San Carlos, Sonora, Mexico
Device: Kindle & WiFi Nook & PocketBook IQ
I have thought about writing up HowTo. But, I actually think that the Header and Footer removal feature needs to change to be part of the PDF-->XHTML phase. It would be less error prone there and easier to specify.

What do you think Helen? Should I post a sample page from the book, post the XHTML intermediary, and my Header removal Regular Expression?
asjogren is offline   Reply With Quote
Old 04-25-2010, 05:15 PM   #8
speakingtohe
Wizard
speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.
 
Posts: 4,589
Karma: 25107878
Join Date: Apr 2010
Device: sony PRS-T1 and T3, Kobo Mini and Aura HD, Tablet
Sounds Very Good to me
And useful which is even better
Helen
speakingtohe is offline   Reply With Quote
Old 04-25-2010, 11:04 PM   #9
asjogren
Addict
asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.asjogren is no ebook tyro.
 
Posts: 266
Karma: 1378
Join Date: Dec 2009
Location: Seattle / San Carlos, Sonora, Mexico
Device: Kindle & WiFi Nook & PocketBook IQ
The start of a sample even numbered page:

Code:
                                 T H E  T I T L E  OF  T H E  B O O K

The text follows for the rest of the page as you would normally expect.
Sentence after sentence.  
The end of the page is just like any other.  
It may split words within sentences.
A sample odd numbered page looks as follows:

Code:
                                                   1 5 2

The text follows for the rest of the page as you would normally expect.  Sentence after sentence.  
The end of the page is just like any other.  
It may split words within sentences.

The Regular Expression I used for Header Removal was:
\d\s\d\s\d|\d\s\d|t\sh\se\s+t\s\i\s\t\sl\s+o\sf\s+ t\sh\se\s+b\so\so\sk

Logic:
- Expression 1 "\d\s\d\s\d" looks for 3 digit page numbers with a space between each digit.
- Expression 2 "\d\s\d" looks for 2 digit page number with a space between the 2 digits.
- I do not look for single digit page numbers because there were too many false positives where text was removed erroneously from the book. As it was there were a couple places where I erroneously lost text with Expression 2.
- Expression 3 "t\sh\se\s+t\s\i\s\t\sl\s+o\sf\s+t\sh\se\s+b\so\so \sk" looks for "the title of the book" in lower case with a space between each character and multiple between words in the title.
- Anchoring the Expression to the start and end of the string did not work - as these page headers were embedded within the resulting text, unlike the PDF source document.

People with more experience with Python Regular Expressions are invited to improve on this novice's attempt.

Last edited by asjogren; 04-25-2010 at 11:26 PM. Reason: Format
asjogren is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Extraneous input mrbillb Calibre 0 09-08-2010 07:21 PM
Input Profile Stinger Calibre 1 05-28-2010 01:39 AM
I just got MORE indecisive... need input :P jackitsu Which one should I buy? 18 03-05-2010 06:36 PM
Hey everyone, I'm looking for some help/input. tokay Astak EZReader 8 07-22-2009 04:00 AM
iLiad chord input axel77 iRex Developer's Corner 0 07-31-2008 02:39 PM


All times are GMT -4. The time now is 12:26 PM.


MobileRead.com is a privately owned, operated and funded community.