Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 03-24-2022, 10:02 PM   #1
enuddleyarbl
Guru
enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.
 
enuddleyarbl's Avatar
 
Posts: 776
Karma: 1538394
Join Date: Sep 2013
Device: Kobo Forma
REGEX to Remove Embedded Header/Footer in the Text from a PDF?

Well, that title probably makes no sense. I've got the text from a PDF file that has the header and footer information embedded within it (so the header/footer looks just like actual text). Something like:

Quote:
blah blah blah

[page number] <== ex Footer

[title] <== ex Header

blah blah blah.
I'd like to convert that text to something passably readable as an EPUB via Calibre. But, before conversion, I need to get rid of those header/footer combinations. My REGEX knowledge is only microscopically above the zero point, and the best I could figure out as a way to find those headers/footers is:

Code:
\s+\d+\s+TITLE\s+
For my own future knowledge, I'll put what I think those codes mean in here:

\s means to match whitespace
+ means to match 1 or more
\d means to match a digit
TITLE is the title of the document that stuck in what used to be a header

So, it looks like that REGEX should grab from the start of the whitespace before the page number and run through the title to the end of the whitespace where the actual text picks up again. Probably not the best bit of REGEX, but it seems to work.

If the text before that header/footer combination is the end of a paragraph, that's fine. But, if the header/footer combination occurs right in the middle of a sentence, then removing it will result in the continuation "paragraph" being smashed right up against the paragraph that was before the header/footer.

For instance:

Quote:
Lit lognued in one of the gseut criahs in N’kcis ofcife, his lnog lges spilwarng
far asorcs the rgu. He was attauneted rehtar tahn bgi. Too mcuh of his chohdliod


11



TITLE

had been snept in fere flla. Now he cluod not fit itno a stadnard prussere siut
or sparcecaft cniba; and whvereer he sta, he lekood lkie he was tniyrg to tkae orev.
would be transmogrified to:

Quote:
Lit lognued in one of the gseut criahs in N’kcis ofcife, his lnog lges spilwarng
far asorcs the rgu. He was attauneted rehtar tahn bgi. Too mcuh of his chohdliodhad been snept in fere flla. Now he cluod not fit itno a stadnard prussere siut
or sparcecaft cniba; and whvereer he sta, he lekood lkie he was tniyrg to tkae orev.
Can anyone come up with a better way to strip out all those headers/footers?

EDIT: I guess if I replace the selection with a CR LF (/r/n), that would work reasonably. It doesn't look like it would be any worse than all the other lines ending with CR LF. I'll have to check and see if Calibre's conversion routine gets rid of those.

Last edited by enuddleyarbl; 03-24-2022 at 11:00 PM.
enuddleyarbl is offline   Reply With Quote
Old 03-25-2022, 03:55 PM   #2
retiredbiker
Evangelist
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 450
Karma: 3886916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma
You are on the right track. As you have seen, sometimes the header or footer will be on its own, sometimes mixed in with other text. Also, maybe not in your current book but very often, you will find varied spellings in the headers/footers. III, roman 3, often appears as Ill, and so on. Titles can be just anything. It all depends on the OCR that created the text, if the text was done with OCR.

To the point where, for example, if I want an epub of something from Internet Archive, I usually find doing my own OCR is quicker and easier than trying to correct their horrible epub, where rotten headers and footers abound, along with other stuff.

The result is, there is no magic expression, regex or plain, that will do the job completely. So do it in the editor, rather than at conversion. Usually with one of these I go through with several expressions, and still find the junk when scrolling slowly through. Several simple expressions are much better than trying to find one magical one to do everything. And depending on how consistent, or not, the book is, "replace all" may never be the best choice.

As to the badly split paragraphs (once the headers are out of the way), some simple regex can help there. Try searching for
Code:
([a-z])</p>\s+<p class="indent">([a-z])
and replace with
Code:
/1 /2
Adjust the "indent" bit with whatever your main paragraph style is. This, and some small variations of it, can quickly do a lot of fixing.

Edit: You don't need CR/LF, just add a space character to your replace string so it appears between the words. (And here you would only need a newline character, \n, anyway.)

Last edited by retiredbiker; 03-25-2022 at 04:02 PM.
retiredbiker is offline   Reply With Quote
Advert
Old 03-25-2022, 05:53 PM   #3
enuddleyarbl
Guru
enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.
 
enuddleyarbl's Avatar
 
Posts: 776
Karma: 1538394
Join Date: Sep 2013
Device: Kobo Forma
Thanks for the help. I had used Notepad++ to remove the header/footer stuff from the pure text. Since your regex was searching for HTML tags, I imported my text into a Calibre book, converted it to EPUB and edited it from within Calibre. Easy peasy.

Your replacement code should be \1 \2 instead of /1 /2. But, other than that, it seems to have worked very nicely. I'm sure there might be an instance or two where a "paragraph" ended with something like a comma instead of a lower case letter (so it wouldn't be caught). But, I'll have to read through to find them. I'd say those edits got me to a nicely readable text.

Thanks, again.
enuddleyarbl is offline   Reply With Quote
Old 03-25-2022, 08:01 PM   #4
retiredbiker
Evangelist
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 450
Karma: 3886916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma
Yeah, typing off the cuff, too fast. Sorry for the /1 /2
retiredbiker is offline   Reply With Quote
Old 06-24-2023, 03:30 AM   #5
Shohreh
Addict
Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.
 
Posts: 207
Karma: 304158
Join Date: Jan 2016
Location: France
Device: none
Before converting it to eg. EPUB, a simpler solution than regexes to get rid of headers+footers from a PDF is to… remove them from the PDF before calling Calibre :-)

Here's a script to mark those sections as "redaction annotations" and remove them.
Shohreh is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Glo remove the header and footer in kepubs tempest@de Kobo Reader 8 05-28-2015 05:40 AM
Structure Detection - Remove Header (or Footer) Regex DarkKipper Conversion 69 11-09-2013 12:21 PM
software to remove pdf header, footer cybmole Conversion 31 04-18-2011 02:37 AM
Regex to remove header from PDF neonbible Calibre 4 09-07-2010 10:08 AM
PDF Conversion - Removing Header / Footer Text heb Sony Reader 9 07-11-2010 11:02 PM


All times are GMT -4. The time now is 01:15 AM.


MobileRead.com is a privately owned, operated and funded community.