Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 03-02-2010, 05:42 AM   #1
DarkKipper
Junior Member
DarkKipper began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Mar 2010
Location: London
Device: iPhone
Structure Detection - Remove Header (or Footer) Regex

Is there any good way of referencing variables like the title of the book in the regular expression?

I've noticed a lot of books, particularly if converted from PDF, have the book title in the header of every page, interfering with the flow of the text, like

title</p><p>

I have quite a good regex set up to remove the common file path footer, page numbers alone on a line, and traces of the abbyy and amber abc converters, and it would be nice to automatically remove a repeated title. I know I can always manually add the actual string for a specific conversion, but it'd be great to do it automatically.

Any thoughts?
DarkKipper is offline   Reply With Quote
Old 03-02-2010, 12:16 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
No, I'm afraid there isn't.
kovidgoyal is offline   Reply With Quote
Old 03-02-2010, 12:55 PM   #3
Starson17
Wizard
Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.Starson17 can program the VCR without an owner's manual.
 
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
Quote:
Originally Posted by DarkKipper View Post
I have quite a good regex set up to remove the common file path footer, page numbers alone on a line, and traces of the abbyy and amber abc converters
Will you share it here?
Starson17 is offline   Reply With Quote
Old 03-10-2010, 02:46 AM   #4
TheBard
Bifocal Wearer
TheBard juggles running chainsaws for a bit of light exerciseTheBard juggles running chainsaws for a bit of light exerciseTheBard juggles running chainsaws for a bit of light exerciseTheBard juggles running chainsaws for a bit of light exerciseTheBard juggles running chainsaws for a bit of light exerciseTheBard juggles running chainsaws for a bit of light exerciseTheBard juggles running chainsaws for a bit of light exerciseTheBard juggles running chainsaws for a bit of light exerciseTheBard juggles running chainsaws for a bit of light exerciseTheBard juggles running chainsaws for a bit of light exerciseTheBard juggles running chainsaws for a bit of light exercise
 
TheBard's Avatar
 
Posts: 49
Karma: 38902
Join Date: Jan 2010
Location: USA
Device: Kobo Touch, Aura, Clara ...
Well, I'm not DarkKipper, but here are a few regular expressions I use. They have worked on my test files, but could probably be improved or modified:

Delete header/footer that starts with "file///" and ends with either ".txt" or ".htm" or "html"
file:///.+\.(txt|html|htm)

Delete line that starts with "file///" and ends with numbers
file:///.+\d

Combine the two above
file:///.+(\d|(txt|html|htm))

Delete a segment of a line in which the segment ends with a specific string
.* - Baroness Orczy
(the " - Baroness Orczy" is in the line)


Here is one that seems to work, but might need a bit of tweaking. It looks for EITHER a line that starts with "file:///" and ends with numbers, OR a line that starts with a specified string, and deletes the found string. Quite handy when looking for headers / footers that may vary somewhat across a subdirectory
(file:///.+\d|Baroness Orczy.*)


Header with "Generated By ABC ... etc .html (the ABC Amber header)
Generated by.+html

Google "The Regex Coach" for a very nice freeware that is extremely helpful in designing regexes.

Hope these help!
TheBard is offline   Reply With Quote
Old 08-19-2010, 06:57 PM   #5
Wreybies
M.P.
Wreybies began at the beginning.
 
Wreybies's Avatar
 
Posts: 7
Karma: 10
Join Date: Aug 2009
Location: Puerto Rico
Device: iPad 64GB WiFi
Bard, those were excellent! Thank you because I was as clueless as Alicia Silverstone after her career nosedived.

Now, One footer that still bugbears me is when there is this on the end:

file:/// blah blah blah.txt (1 of 129) [2/4/03 9:31:57 PM]

When I run the Regex code to make that footer go away, and test it before the actual conversation, the whole line of offending footer goes yellow as if it is going to go bye-bye, but in the end result, starting from the (1 of part to the PM] remains in the final conversion. Even when I run it again, epub to epub this time to debug, I still get it even though the test makes it look as if it will delete it.

What am I doing wrong? Anyone?

I used: file:///.+.PM]

Last edited by Wreybies; 08-19-2010 at 06:59 PM.
Wreybies is offline   Reply With Quote
Old 09-11-2010, 08:27 PM   #6
PCreighton
Enthusiast
PCreighton began at the beginning.
 
PCreighton's Avatar
 
Posts: 27
Karma: 10
Join Date: Aug 2010
Location: Ontario Canada
Device: Kindle 2; Kindle WIFI 6";IPAD 2
sorry really new

so I understand you use regex where do you place this line to exclude header and footer?

Last edited by PCreighton; 09-11-2010 at 08:30 PM.
PCreighton is offline   Reply With Quote
Old 09-11-2010, 09:04 PM   #7
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by PCreighton View Post
so I understand you use regex where do you place this line to exclude header and footer?
In the conversion options go to Structure detection, there is a text box to place regexes for headers/footers, as well as a preview function so you can write the regex and see where it matches in the file itself.
ldolse is offline   Reply With Quote
Old 09-17-2010, 11:12 AM   #8
Wreybies
M.P.
Wreybies began at the beginning.
 
Wreybies's Avatar
 
Posts: 7
Karma: 10
Join Date: Aug 2009
Location: Puerto Rico
Device: iPad 64GB WiFi
Quote:
Originally Posted by ldolse View Post
In the conversion options go to Structure detection, there is a text box to place regexes for headers/footers, as well as a preview function so you can write the regex and see where it matches in the file itself.
The correct area of Calibre looks like this:



Make sure to check the box for Remove Header. You needn't bother with the Remove Footer. I have found that it doesn't really work. The Remove Header area can be used to remove both headers and footers. When the program is searching the strings that match your regex it makes no distinction between where that string is physically located. If you tap the magic wand lookin' thingie to the right, then you get a preview of the text with all the html tags in place and you can put your regex string in the area provided at the top of the preview window to test if the string will flag for removal the items you really want to remove. Don't get frustrated. This is a trial and error process when there are variable strings, and you may need to do the process more than once if there are different kinds of strings that you want gone.

EDIT ~ And on a side note: Quite often the removal of a header or footer will cause inappropriate paragraph breaks because though the string of the header or footer has been removed, if you don't also remove the html tags that surround that header/footer, this may well cause paragraph breaks or extra carriage returns. If you are a picky bugger like me, then you will want to take those tags into account when you are creating the strings of regex to make go bye-bye the things you want gone.

Last edited by Wreybies; 09-17-2010 at 02:08 PM.
Wreybies is offline   Reply With Quote
Old 09-19-2010, 04:46 AM   #9
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
thanks - I've been searching a while for how to remove the "generated by...."

so I get that should put
file:///.+(\d|(txt|html|htm))
into preferences structure detection, but what happens to the default expression that's already in there ( which I don't really understand).

do I overwrite it with the above, and if, so what do I lose i.e. what was the default expression doing that yours may not do ?

Last edited by cybmole; 09-19-2010 at 05:09 AM.
cybmole is offline   Reply With Quote
Old 09-19-2010, 05:00 AM   #10
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
i can't get this to work:
I have a book in epub with lots of instances of Generated by ABC Amber LIT Converter, http://www.processtext.com/abclit.html

so I change header detection to file:///.+(\d|(txt|html|htm)) & tick boxes as per above instructions, then force a conversion from epub to epub - the offending spam is still there ???

also, I've screwed up - I copied the default regex to note pad, so that I could put it back again, but I did not grab the entire line, how do I restore the default expression please. is it the same as the footer expression default ?

Last edited by cybmole; 09-19-2010 at 05:06 AM.
cybmole is offline   Reply With Quote
Old 09-19-2010, 05:12 AM   #11
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
You overwrite the default expression. As long as you do it from the convert dialog and not the preferences dialog you don't lose anything, it's only lost for that book. Anyway the default is just an example, it generally needs to be edited in order for anything to match.

file:///.+(\d|(txt|html|htm))

won't get rid of the Amber Lit converter message, that will get rid of headers/footers inserted by browsers when pdf printing html. You'll need a different regex for Amber Lit. Try this:
Code:
(<A name=\d+>\s*</a>)?\s*(<[biu][^>]*>)?\s*Generated\s+by\s+(ABC)?\s+Amber[^<]*(<a\shref=.*?processtext.*?>)?\s*(.*?processtext.*?</a>)?(</[ibu]>)?\s*(<br>\s*)?

Last edited by ldolse; 09-19-2010 at 06:31 AM.
ldolse is offline   Reply With Quote
Old 09-19-2010, 05:20 AM   #12
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
Quote:
Originally Posted by ldolse View Post
You overwrite the default expression. As long as you do it from the convert dialog and not the preferences dialog you don't lose anything, it's only lost for that book. Anyway the default is just an example, it generally needs to be edited in order for anything to match.

file:///.+(\d|(txt|html|htm))

won't get rid of the Amber Lit converter message, that will get rid of headers/footers inserted by browsers when pdf printing html. You'll need a different regex for Amber Lit. Try this:
Code:
<A name=\d+>\s*</a>\s*(<[biu]>)?\s*Generated\s+by\s+(ABC)?\s+Amber[^<]*(<a\shref=.*?processtext.*?</a>)?\s*(</[ibu]>)?\s*<br>
thanks - but it did not work - i pasted in everying inside of [code]...[code] on the convert - structure detection page from above & forced a reconvert - but the amber lit stuff is still there

as I'm a total noob with regex, could you please look at the attached .epub book & tell me what will work - thanks.

Last edited by cybmole; 09-19-2010 at 06:00 AM.
cybmole is offline   Reply With Quote
Old 09-19-2010, 05:33 AM   #13
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
You can't put copyrighted books on Mobileread, I suggest you edit your post and delete it. All you need to do is click on the structure detection wizard and click the magic wand. Find one instance of the 'generated by' message and just copy/paste that text and a few surrounding lines - paste it into a phpbb code block. The epub is worthless for analysis as a amount of processing happens between the footer removal stage and the epub output stage.
ldolse is offline   Reply With Quote
Old 09-19-2010, 05:45 AM   #14
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
Well I do own the paper copy! but OK -
I never knew what the magic wand was for ...

so I follow your instructions as far as locate an instance of the offending spam, then I'm stuck.

is this what you need to see:
[code]
"Maybe, but the cops feel like you do, Anita. What's one more dead vampire? New laws don't change that." </p><p class="calibre4">
<b class="calibre2">Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre3">erter, http://www.processtext.com/abclit.html</a></b></p><p class="calibre4">
It had only been two years since Addison v. Clark. The court case gave us a revised version of what life was[code]

NB I only have the book in this epub format - that's the format the I found it in. the whole series has this spam throughout.

Last edited by cybmole; 09-19-2010 at 05:47 AM.
cybmole is offline   Reply With Quote
Old 09-19-2010, 05:56 AM   #15
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
I thought you were converting from pdf to epub?

Code:
(<A name=\d+>\s*</a>)?\s*(<[biu][^>]*>)?\s*Generated\s+by\s+(ABC)?\s+Amber[^<]*(<a\shref=.*?processtext.*?>)?\s*(.*?processtext.*?</a>)?(</[ibu]>)?\s*(<br>\s*)?
That should take care of more variants of the spam, including yours.

Please remove the book from your previous post - it doesn't matter whether you own it, the problem is posting it to a public bulletin board that doesn't condone piracy.

Last edited by ldolse; 09-19-2010 at 06:32 AM.
ldolse is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex help to remove HTML footer neonbible Calibre 4 09-09-2010 09:42 AM
Regex to remove header from PDF neonbible Calibre 4 09-07-2010 10:08 AM
Removing header and footer radicalnomad Calibre 2 08-26-2010 10:34 AM
Header/Footer removal Solicitous Calibre 2 03-30-2010 05:53 AM
Multiline Regex Footer hover Calibre 10 02-03-2010 04:23 AM


All times are GMT -4. The time now is 01:15 PM.


MobileRead.com is a privately owned, operated and funded community.