MobileRead Forums - View Single Post - REGEX to Remove Embedded Header/Footer in the Text from a PDF?

retiredbiker · 03-25-2022, 03:55 PM

You are on the right track. As you have seen, sometimes the header or footer will be on its own, sometimes mixed in with other text. Also, maybe not in your current book but very often, you will find varied spellings in the headers/footers. III, roman 3, often appears as Ill, and so on. Titles can be just anything. It all depends on the OCR that created the text, if the text was done with OCR.

To the point where, for example, if I want an epub of something from Internet Archive, I usually find doing my own OCR is quicker and easier than trying to correct their horrible epub, where rotten headers and footers abound, along with other stuff.

The result is, there is no magic expression, regex or plain, that will do the job completely. So do it in the editor, rather than at conversion. Usually with one of these I go through with several expressions, and still find the junk when scrolling slowly through. Several simple expressions are much better than trying to find one magical one to do everything. And depending on how consistent, or not, the book is, "replace all" may never be the best choice.

As to the badly split paragraphs (once the headers are out of the way), some simple regex can help there. Try searching for

Code:

([a-z])</p>\s+<p class="indent">([a-z])

and replace with

Code:

/1 /2

Adjust the "indent" bit with whatever your main paragraph style is. This, and some small variations of it, can quickly do a lot of fixing.

Edit: You don't need CR/LF, just add a space character to your replace string so it appears between the words. (And here you would only need a newline character, \n, anyway.)

03-25-2022, 03:55 PM	#2
retiredbiker Evangelist Posts: 450 Karma: 3886916 Join Date: May 2013 Location: Ontario, Canada Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma	You are on the right track. As you have seen, sometimes the header or footer will be on its own, sometimes mixed in with other text. Also, maybe not in your current book but very often, you will find varied spellings in the headers/footers. III, roman 3, often appears as Ill, and so on. Titles can be just anything. It all depends on the OCR that created the text, if the text was done with OCR. To the point where, for example, if I want an epub of something from Internet Archive, I usually find doing my own OCR is quicker and easier than trying to correct their horrible epub, where rotten headers and footers abound, along with other stuff. The result is, there is no magic expression, regex or plain, that will do the job completely. So do it in the editor, rather than at conversion. Usually with one of these I go through with several expressions, and still find the junk when scrolling slowly through. Several simple expressions are much better than trying to find one magical one to do everything. And depending on how consistent, or not, the book is, "replace all" may never be the best choice. As to the badly split paragraphs (once the headers are out of the way), some simple regex can help there. Try searching for Code: ([a-z])</p>\s+<p class="indent">([a-z]) and replace with Code: /1 /2 Adjust the "indent" bit with whatever your main paragraph style is. This, and some small variations of it, can quickly do a lot of fixing. Edit: You don't need CR/LF, just add a space character to your replace string so it appears between the words. (And here you would only need a newline character, \n, anyway.) Last edited by retiredbiker; 03-25-2022 at 04:02 PM.