Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 08-20-2009, 02:31 PM   #1
rrosenwald
OS/2 forever
rrosenwald began at the beginning.
 
rrosenwald's Avatar
 
Posts: 12
Karma: 10
Join Date: Aug 2009
Location: Scottsdale, AZ
Device: none
Remove Header from PDF

I'm semi-desperate. I need to remove headers from PDFs which I need to convert to ePub which, on even numbered pages look like "nnn Author Name" and on odd numbered pages like "Book Title nnn" (no quotes and where nnn is the page number.) I have tried virtually every Regex I can come up with and nothing works. I can remove Author Name or Book Title but not page numbers, or I can remove page numbers but not author name or book title. Any help would be greatly appreciated.
rrosenwald is offline   Reply With Quote
Old 08-20-2009, 05:21 PM   #2
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
I haven't gotten around to producing any GUI tools to make it easier. The process right now is a bit difficult but, it is ebook-convert in.pdf .epub --debug-input ./out_dir, look at the html produced, create a regex, convert using the --remove-header, --remove-footer and set the header and footer regexs.
user_none is offline   Reply With Quote
Advert
Old 08-20-2009, 06:42 PM   #3
rrosenwald
OS/2 forever
rrosenwald began at the beginning.
 
rrosenwald's Avatar
 
Posts: 12
Karma: 10
Join Date: Aug 2009
Location: Scottsdale, AZ
Device: none
Part of my problem is that I have 200+ books to convert and I'd rather not touch each one individually.
rrosenwald is offline   Reply With Quote
Old 08-20-2009, 07:47 PM   #4
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
You don't have much choice in the matter. The page numbers are part of the text and there is little change one regex will work for all of them.
user_none is offline   Reply With Quote
Old 08-20-2009, 11:54 PM   #5
rrosenwald
OS/2 forever
rrosenwald began at the beginning.
 
rrosenwald's Avatar
 
Posts: 12
Karma: 10
Join Date: Aug 2009
Location: Scottsdale, AZ
Device: none
I believe the page numbers are part of the headers. I can remove them easily with a Regex of [0-9]+ in the Remove Header box.
rrosenwald is offline   Reply With Quote
Advert
Old 08-21-2009, 03:03 AM   #6
itimpi
Wizard
itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.
 
Posts: 4,552
Karma: 950151
Join Date: Nov 2008
Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)
As I understand it, it the "Remove Headers" facility in Calibre is intended to handle the case where the headers are embedded in the text. The RegEx expression is used to identify the text that belongs to the header.
itimpi is offline   Reply With Quote
Old 08-21-2009, 08:16 AM   #7
rrosenwald
OS/2 forever
rrosenwald began at the beginning.
 
rrosenwald's Avatar
 
Posts: 12
Karma: 10
Join Date: Aug 2009
Location: Scottsdale, AZ
Device: none
One other thing that adds to frustration is that the MobiPocket ebook creator does seem to identify and strip out the headers.
rrosenwald is offline   Reply With Quote
Old 08-21-2009, 06:58 PM   #8
DerSchwarzePrinz
Member
DerSchwarzePrinz began at the beginning.
 
Posts: 24
Karma: 16
Join Date: Aug 2009
Device: Pocketbook 360, Sony PRS-T1
Quote:
Originally Posted by rrosenwald View Post
One other thing that adds to frustration is that the MobiPocket ebook creator does seem to identify and strip out the headers.
Have you tried

\d{1,3} Author Name

in remove header line

and

Book Title \d{1,3}

in remove footer line?
DerSchwarzePrinz is offline   Reply With Quote
Old 08-22-2009, 02:41 PM   #9
rrosenwald
OS/2 forever
rrosenwald began at the beginning.
 
rrosenwald's Avatar
 
Posts: 12
Karma: 10
Join Date: Aug 2009
Location: Scottsdale, AZ
Device: none
Yes. Unfortunately that doesn't work either.
rrosenwald is offline   Reply With Quote
Old 08-22-2009, 05:16 PM   #10
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by DerSchwarzePrinz View Post
Have you tried

\d{1,3} Author Name

in remove header line

and

Book Title \d{1,3}

in remove footer line?
In most cases this won't work because the regex matches against the HTML produced at a middle stage in the conversion pipeline. In most cases you're going to need something like:

Code:
(<A name=\d+></a><i>\d+</i><br>\s*<i>Book Title</i><br>)|(<A name=\d+></a><i>Book Title</i><br>\s*<i>\d+</i><br>)
Right now the only way to get a look at that intermediary HTML is to use the command line ebook-convert tool with the --debug-input flag.
user_none is offline   Reply With Quote
Old 08-22-2009, 08:36 PM   #11
DerSchwarzePrinz
Member
DerSchwarzePrinz began at the beginning.
 
Posts: 24
Karma: 16
Join Date: Aug 2009
Device: Pocketbook 360, Sony PRS-T1
Quote:
Originally Posted by user_none View Post
In most cases this won't work because the regex matches against the HTML produced at a middle stage in the conversion pipeline. In most cases you're going to need something like:

Code:
(<A name=\d+></a><i>\d+</i><br>\s*<i>Book Title</i><br>)|(<A name=\d+></a><i>Book Title</i><br>\s*<i>\d+</i><br>)
Right now the only way to get a look at that intermediary HTML is to use the command line ebook-convert tool with the --debug-input flag.
I found the following in the HTML output:

»Tatsächlich?«,&nbsp;&nbsp;erwiderte&nbsp;&nbsp;de r&nbsp;&nbsp;junge&nbsp;&nbsp;Mann&nbsp;&nbsp;tr o-<br>
5<br>
<hr>
<A name=6></a>
cken.<br>

How can I remove the passage marked in bold (numer 5 is the page number)?
DerSchwarzePrinz is offline   Reply With Quote
Reply

Tags
epub, pdf conversion, regular expressions

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Structure Detection - Remove Header (or Footer) Regex DarkKipper Conversion 69 11-09-2013 12:21 PM
[Old Thread] Removing ABBYY header in a PDF robertlc Conversion 33 09-09-2011 12:12 AM
Regex to remove header from PDF neonbible Calibre 4 09-07-2010 10:08 AM
PDF Conversion - Removing Header / Footer Text heb Sony Reader 9 07-11-2010 11:02 PM
Remove Header feature not working sentience Calibre 1 01-09-2010 02:11 PM


All times are GMT -4. The time now is 07:18 PM.


MobileRead.com is a privately owned, operated and funded community.