Regex help to remove HTML footer

neonbible · 09-09-2010, 05:38 AM

This is the HTML code:

Code:

<br clear="all"/><hr/><div class="center"><small><a href="slide19.html">previous</a> |
<a href="toc.html">Table of Contents</a> |
<a href="slide21.html">next</a></small></div>
</body>
</html>

Now for each page the anchor tags are going to change to point to different links. What expression do I need to use to match it for each page?

neonbible · 09-09-2010, 05:51 AM

Ok I managed to find the answer.

Used .+\

When I test the expression, it highlights the sections correctly. However after the conversion, they are still there! Even if I ticked remove footer.

ldolse · 09-09-2010, 08:29 AM

Not sure what you're trying to remove from that html code. Are you saying every page has 'previous', 'table of contents', and 'next' links?

.+\ should only be .+, but try [^>]* because it isn't greedy. You also need to account for variable spacing across line breaks and between tags, \s* helps for that. If some of the parts don't occur every time then surround it with parentheses - e.g. "(<br[^>]*>)" and add a question mark to make it optional - "(<br[^>]*>)?"

Try something like this:

Code:

<br[^>]*>\s*<hr/>\s*<div[^>]*>\s*<small>\s*<a\shref[^>]*>\s*previous\s*</a>\s*\|\s*<a\shref[^>]*>\s*Table\sof\sContents\s*</a>\s*\|\s*<a\shref[^>]*>\s*next\s*</a>\s*</small>\s*</div>

jackie_w · 09-09-2010, 08:38 AM

As your source is HTML, if all else fails, you could always try editing the HTML in a text editor before importing to Calibre.

For example, Notepad++ is a very good free text editor, it supports Regex and allows you to find/replace across multiple open files in one hit.

theducks · 09-09-2010, 09:42 AM

Quote:

Originally Posted by neonbible

This is the HTML code:

Code:

<br clear="all"/><hr/><div class="center"><small><a href="slide19.html">previous</a> |
<a href="toc.html">Table of Contents</a> |
<a href="slide21.html">next</a></small></div>
</body>
</html>

Now for each page the anchor tags are going to change to point to different links. What expression do I need to use to match it for each page?

You may need to replace the digits in the strings "slide##" with wild cards, as they are UNIQUE for each each "Next" and "Previous" text

09-09-2010, 05:38 AM	#1
neonbible Addict Posts: 202 Karma: 10802 Join Date: Sep 2010 Device: Kindle Paperwhite, iPhone 5, iPad Air, Nexus 7	Regex help to remove HTML footer This is the HTML code: Code: <br clear="all"/><hr/><div class="center"><small><a href="slide19.html">previous</a> \| <a href="toc.html">Table of Contents</a> \| <a href="slide21.html">next</a></small></div> </body> </html> Now for each page the anchor tags are going to change to point to different links. What expression do I need to use to match it for each page?

09-09-2010, 08:29 AM	#3
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Not sure what you're trying to remove from that html code. Are you saying every page has 'previous', 'table of contents', and 'next' links? .+\ should only be .+, but try [^>]* because it isn't greedy. You also need to account for variable spacing across line breaks and between tags, \s* helps for that. If some of the parts don't occur every time then surround it with parentheses - e.g. "(<br[^>]>)" and add a question mark to make it optional - "(<br[^>]>)?" Try something like this: Code: <br[^>]>\s<hr/>\s<div[^>]>\s<small>\s<a\shref[^>]>\sprevious\s</a>\s\\|\s<a\shref[^>]>\sTable\sof\sContents\s</a>\s\\|\s<a\shref[^>]>\snext\s</a>\s</small>\s*</div>

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Structure Detection - Remove Header (or Footer) Regex	DarkKipper	Conversion	69	11-09-2013 12:21 PM
Regex to remove header from PDF	neonbible	Calibre	4	09-07-2010 10:08 AM
Remove Footer	cdecaf	Calibre	44	07-21-2010 05:48 AM
remove PDF footer containing variable?	irisclara	Calibre	10	03-06-2010 10:53 PM
Multiline Regex Footer	hover	Calibre	10	02-03-2010 04:23 AM

09-09-2010, 05:51 AM	#2
neonbible Addict Posts: 202 Karma: 10802 Join Date: Sep 2010 Device: Kindle Paperwhite, iPhone 5, iPad Air, Nexus 7	Ok I managed to find the answer. Used .+\ When I test the expression, it highlights the sections correctly. However after the conversion, they are still there! Even if I ticked remove footer.

09-09-2010, 08:38 AM	#4
jackie_w Grand Sorcerer Posts: 6,212 Karma: 16534894 Join Date: Sep 2009 Location: UK Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3	As your source is HTML, if all else fails, you could always try editing the HTML in a text editor before importing to Calibre. For example, Notepad++ is a very good free text editor, it supports Regex and allows you to find/replace across multiple open files in one hit.

Advert

Advert