08-20-2009, 02:31 PM | #1 |
OS/2 forever
Posts: 12
Karma: 10
Join Date: Aug 2009
Location: Scottsdale, AZ
Device: none
|
Remove Header from PDF
I'm semi-desperate. I need to remove headers from PDFs which I need to convert to ePub which, on even numbered pages look like "nnn Author Name" and on odd numbered pages like "Book Title nnn" (no quotes and where nnn is the page number.) I have tried virtually every Regex I can come up with and nothing works. I can remove Author Name or Book Title but not page numbers, or I can remove page numbers but not author name or book title. Any help would be greatly appreciated.
|
08-20-2009, 05:21 PM | #2 |
Sigil & calibre developer
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
I haven't gotten around to producing any GUI tools to make it easier. The process right now is a bit difficult but, it is ebook-convert in.pdf .epub --debug-input ./out_dir, look at the html produced, create a regex, convert using the --remove-header, --remove-footer and set the header and footer regexs.
|
Advert | |
|
08-20-2009, 06:42 PM | #3 |
OS/2 forever
Posts: 12
Karma: 10
Join Date: Aug 2009
Location: Scottsdale, AZ
Device: none
|
Part of my problem is that I have 200+ books to convert and I'd rather not touch each one individually.
|
08-20-2009, 07:47 PM | #4 |
Sigil & calibre developer
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
You don't have much choice in the matter. The page numbers are part of the text and there is little change one regex will work for all of them.
|
08-20-2009, 11:54 PM | #5 |
OS/2 forever
Posts: 12
Karma: 10
Join Date: Aug 2009
Location: Scottsdale, AZ
Device: none
|
I believe the page numbers are part of the headers. I can remove them easily with a Regex of [0-9]+ in the Remove Header box.
|
Advert | |
|
08-21-2009, 03:03 AM | #6 |
Wizard
Posts: 4,552
Karma: 950151
Join Date: Nov 2008
Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)
|
As I understand it, it the "Remove Headers" facility in Calibre is intended to handle the case where the headers are embedded in the text. The RegEx expression is used to identify the text that belongs to the header.
|
08-21-2009, 08:16 AM | #7 |
OS/2 forever
Posts: 12
Karma: 10
Join Date: Aug 2009
Location: Scottsdale, AZ
Device: none
|
One other thing that adds to frustration is that the MobiPocket ebook creator does seem to identify and strip out the headers.
|
08-21-2009, 06:58 PM | #8 |
Enthusiast
Posts: 25
Karma: 16
Join Date: Aug 2009
Device: Pocketbook 360, Sony PRS-T1
|
|
08-22-2009, 02:41 PM | #9 |
OS/2 forever
Posts: 12
Karma: 10
Join Date: Aug 2009
Location: Scottsdale, AZ
Device: none
|
Yes. Unfortunately that doesn't work either.
|
08-22-2009, 05:16 PM | #10 | |
Sigil & calibre developer
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
Quote:
Code:
(<A name=\d+></a><i>\d+</i><br>\s*<i>Book Title</i><br>)|(<A name=\d+></a><i>Book Title</i><br>\s*<i>\d+</i><br>) |
|
08-22-2009, 08:36 PM | #11 | |
Enthusiast
Posts: 25
Karma: 16
Join Date: Aug 2009
Device: Pocketbook 360, Sony PRS-T1
|
Quote:
»Tatsächlich?«, erwiderte de r junge Mann tr o-<br> 5<br> <hr> <A name=6></a>cken.<br> How can I remove the passage marked in bold (numer 5 is the page number)? |
|
Tags |
epub, pdf conversion, regular expressions |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Structure Detection - Remove Header (or Footer) Regex | DarkKipper | Conversion | 69 | 11-09-2013 12:21 PM |
[Old Thread] Removing ABBYY header in a PDF | robertlc | Conversion | 33 | 09-09-2011 12:12 AM |
Regex to remove header from PDF | neonbible | Calibre | 4 | 09-07-2010 10:08 AM |
PDF Conversion - Removing Header / Footer Text | heb | Sony Reader | 9 | 07-11-2010 11:02 PM |
Remove Header feature not working | sentience | Calibre | 1 | 01-09-2010 02:11 PM |