[Regex] How to remove a whole final section from a blog post?

HenryHutton · 02-25-2023, 06:47 AM

I "news fetched" some posts from a blog of an author I would like to have on my e-reader as an ebook.

Now I am trying to edit out with the Editor the last section of each post , which contains links and informations I don't need on the final ebook.

All posts end with a signature, a motto, which is:

Code:

<p class="calibre10"><span>[Il mondo è bello, siamo noi ad esser ciechi]</span></p>

So my aim is to get an expression that includes this last previous bit (as a (group) to feed the "replace" field), down to

Code:

</body>

</html>

(ideally a second (group) ), so to trim out all the links and unneeded infos.

Well, so far I didn't achieved much..

My BEST (

) guess was...

Code:

(\[Il mondo è bello, siamo noi ad esser ciechi\])*</body>\w+</html>

but of course it doesn't work.

Any other functions/tricks that would achieve the same output are welcome!

I running short of time, that's why I am asking some hints instead of reading and learning more (or edit them all out manually).

I attach one of the html fetched.

The blog is reachable here, for the record:
http://www.salvatorebrizzi.com/

theducks · 02-25-2023, 07:29 AM

try replacing the \w+ with \s+ after </body>

lomkiri · 02-25-2023, 08:30 AM

Quote:

Originally Posted by theducks

try replacing the \w+ with \s+ after </body>

To strip out the whole footer, I would rather do this search/replace, don't you think so ?

Code:

search:
\s<p class="calibre10"><span>\[Il mondo è bello, siamo noi ad esser ciechi\].*</body>

replace:
</div>\n\n  </div>\n\n</body>

@Henry:
"dot all" must be checked.
(the cursor must be on top of the file, or, at least, before the part that will be removed)

No group is necessary (unless you want put </body> in a group), since you're not reusing anything from the selected expression.
The 2 </div> in the replace field are necessary, if not, the code would be unbalanced and the book checking (F7) will fail
* is not enough to "select everything", it's only a multiplicator. You need .* or .*? to select everything (respectively greedy or not greedy)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Is it possible to remove the section/article table of contents & header navigation?	Maleficent-Fly	Recipes	1	05-28-2022 11:01 PM
epub → pdf conversion: remove a section	dma_k	Conversion	8	08-31-2016 05:40 PM
Regex to remove the first 4 characters	nynaevelan	Library Management	3	07-19-2014 06:41 PM
Regex to remove header from PDF	neonbible	Calibre	4	09-07-2010 10:08 AM

02-25-2023, 07:29 AM	#2
theducks Well trained by Cats Posts: 31,021 Karma: 60358908 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	try replacing the \w+ with \s+ after </body>

Advert