MobileRead Forums - View Single Post - REGEX to Remove Embedded Header/Footer in the Text from a PDF?

enuddleyarbl · 03-25-2022, 06:53 PM

Thanks for the help. I had used Notepad++ to remove the header/footer stuff from the pure text. Since your regex was searching for HTML tags, I imported my text into a Calibre book, converted it to EPUB and edited it from within Calibre. Easy peasy.

Your replacement code should be \1 \2 instead of /1 /2. But, other than that, it seems to have worked very nicely. I'm sure there might be an instance or two where a "paragraph" ended with something like a comma instead of a lower case letter (so it wouldn't be caught). But, I'll have to read through to find them. I'd say those edits got me to a nicely readable text.

Thanks, again.

03-25-2022, 06:53 PM	#3
enuddleyarbl Guru Posts: 793 Karma: 1538394 Join Date: Sep 2013 Device: Kobo Sage	Thanks for the help. I had used Notepad++ to remove the header/footer stuff from the pure text. Since your regex was searching for HTML tags, I imported my text into a Calibre book, converted it to EPUB and edited it from within Calibre. Easy peasy. Your replacement code should be \1 \2 instead of /1 /2. But, other than that, it seems to have worked very nicely. I'm sure there might be an instance or two where a "paragraph" ended with something like a comma instead of a lower case letter (so it wouldn't be caught). But, I'll have to read through to find them. I'd say those edits got me to a nicely readable text. Thanks, again.