View Single Post
Old 10-11-2012, 02:34 PM   #1
MontyJ
Enthusiast
MontyJ began at the beginning.
 
Posts: 39
Karma: 10
Join Date: Jul 2012
Device: Kindle
Question Those pesky "file:///" headers/footers

Hola!

Regex/tag code noob here!!

I successfully implemented the "ABC Amber" solution, but I simply cannot find a solution in Calibre for these header/footer issues.

The input file is PDF. If I convert the file to HTMLZ, then look at the source html info, the footer info looks like this(c&P, I altered only the book title info to protect the innocent!):

Code:
<p class="calibre1">|/eMaaa/Inbbb/Chccc,%20C.J.%20-%20Tddd%20of%20Seee%20and%20Jfff,%20The%20v2.htm (2 of 230)15-8-2005 22:23:09</p>
However, in the PDF viewer, the result (before conversion) looks like this:

Code:
file:///H|/eMaaa/Inbbb/Chccc,%20C.J.%20-%20Tddd%20of%20Seee%20and%20Jfff,%20The%20v2.htm (1 of 230)15-8-2005 22:23:09
Since I want to remove any and all headers/footers that start with the "file:///", I tried:

Code:
file:///.+\d
I also tried several variants on the prefix tag code, like:

Code:
<p.*?> and <b.*?>
but still no detection. I have tried a dozen other variations on this, but nothing triggers a hit when I use the Regex Builder. I put in the regex code, select "Open" and select the PDF file, then hit "Test", but no detection.

I would like to keep any "normal" headers/footers, so 'cropping' or 'stripping' ALL headers/footers is a last resort.

So either I am using the builder feature improperly, or my regex codes are missing the mark totally!...Or both, LoL.

Thanks for any pointers!

MontyJ

Update: I found a long way around most of the problem:

1. Add this code in the conversion SEARCH regex definition-
Code:
<p.*?>file:///\S\|.*?</p>
2. Convert PDF, EPUB, (any format actually) book to HTMLZ
3. Then "Add book" and select the HTMLZ file
4. Select CONVERT to EPUB, MOBI, or whatever

The offending header/footer code is now gone. However, I still have a few artifacts left that are left over, like the occasional "|" character. An "empty string" is left over in the html file, and this may be the issue.

This code DOES NOT WORK directly on EPUB or MOBI files as is, so need to figure that out as well to save that extra conversion step!

Last edited by MontyJ; 10-11-2012 at 11:50 PM. Reason: Update
MontyJ is offline   Reply With Quote