Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 10-11-2012, 02:34 PM   #1
MontyJ
Addict
MontyJ began at the beginning.
 
Posts: 224
Karma: 10
Join Date: Jul 2012
Device: Kindle
Question Those pesky "file:///" headers/footers

Hola!

Regex/tag code noob here!!

I successfully implemented the "ABC Amber" solution, but I simply cannot find a solution in Calibre for these header/footer issues.

The input file is PDF. If I convert the file to HTMLZ, then look at the source html info, the footer info looks like this(c&P, I altered only the book title info to protect the innocent!):

Code:
<p class="calibre1">|/eMaaa/Inbbb/Chccc,%20C.J.%20-%20Tddd%20of%20Seee%20and%20Jfff,%20The%20v2.htm (2 of 230)15-8-2005 22:23:09</p>
However, in the PDF viewer, the result (before conversion) looks like this:

Code:
file:///H|/eMaaa/Inbbb/Chccc,%20C.J.%20-%20Tddd%20of%20Seee%20and%20Jfff,%20The%20v2.htm (1 of 230)15-8-2005 22:23:09
Since I want to remove any and all headers/footers that start with the "file:///", I tried:

Code:
file:///.+\d
I also tried several variants on the prefix tag code, like:

Code:
<p.*?> and <b.*?>
but still no detection. I have tried a dozen other variations on this, but nothing triggers a hit when I use the Regex Builder. I put in the regex code, select "Open" and select the PDF file, then hit "Test", but no detection.

I would like to keep any "normal" headers/footers, so 'cropping' or 'stripping' ALL headers/footers is a last resort.

So either I am using the builder feature improperly, or my regex codes are missing the mark totally!...Or both, LoL.

Thanks for any pointers!

MontyJ

Update: I found a long way around most of the problem:

1. Add this code in the conversion SEARCH regex definition-
Code:
<p.*?>file:///\S\|.*?</p>
2. Convert PDF, EPUB, (any format actually) book to HTMLZ
3. Then "Add book" and select the HTMLZ file
4. Select CONVERT to EPUB, MOBI, or whatever

The offending header/footer code is now gone. However, I still have a few artifacts left that are left over, like the occasional "|" character. An "empty string" is left over in the html file, and this may be the issue.

This code DOES NOT WORK directly on EPUB or MOBI files as is, so need to figure that out as well to save that extra conversion step!

Last edited by MontyJ; 10-11-2012 at 11:50 PM. Reason: Update
MontyJ is offline   Reply With Quote
Old 10-12-2012, 11:24 PM   #2
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,457
Karma: 26645808
Join Date: Mar 2012
Location: Sydney Australia
Device: none
@MontyJ - did you try doing a convert from PDF to PRC using the MobiCreator program - my experience is that 90% of the time gets rid of headers and footers without any effort on my part.

Then you can convert the PRC to EPUB and use Sigil or whatever to tidy up loose ends.

BR
BetterRed is offline   Reply With Quote
Advert
Old 10-12-2012, 11:53 PM   #3
MontyJ
Addict
MontyJ began at the beginning.
 
Posts: 224
Karma: 10
Join Date: Jul 2012
Device: Kindle
Thanks BR, will look into that.

There is one other "insert obnoxious ad" outfit that does not seem to have a solution, however. it is ABB YYY or something like that; has big yellow images in every corner of a page with a clickable "Buy Now" link.

While most of the image can be automatically removed with regex code like the above, they somehow embed the closing TAGS in the last section of the image within normal text, and it is random, not predictable. So if you look for a simple string with opening and closing tags and remove them, you remove some amount of normal text as well. Since it is a random process, the only way to get it out is a manual edit of every page.
MontyJ is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Custom column: "Updated date", when adding new "versions" of the same file? enriquep Library Management 16 11-03-2011 10:46 AM
Removing Headers and Footers Here's What I Did allowingtoo Workshop 0 02-16-2011 08:46 PM
File names with "(" and ")" can cause screen freezes greenapple Ectaco jetBook 5 02-04-2010 08:25 PM
Help! the "Make Sony Reader File" under "Options" is different Dr. Drib Sony Reader 6 04-23-2007 02:56 AM


All times are GMT -4. The time now is 03:55 PM.


MobileRead.com is a privately owned, operated and funded community.