MobileRead Forums - View Single Post - pdf to epub regex unicode character match not working

retiredbiker · 09-11-2021, 04:16 PM

I would try something more simple:

chapter.*?\d+
and
report.*?•.*?discuss

But I have found that headers and footers in OCR'd pdfs often come across with strange spacing, text scannos, and all sorts of cruft. Often you get some, but not others. So I do this in the Editor, after conversion, where I have a chance to find the exceptions. Doing it that way also gives you a chance to re-connect text that was separated by the header or footer at the page break.

09-11-2021, 04:16 PM	#3
retiredbiker Evangelist Posts: 461 Karma: 3886916 Join Date: May 2013 Location: Ontario, Canada Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma	I would try something more simple: chapter.?\d+ and report.?•.*?discuss But I have found that headers and footers in OCR'd pdfs often come across with strange spacing, text scannos, and all sorts of cruft. Often you get some, but not others. So I do this in the Editor, after conversion, where I have a chance to find the exceptions. Doing it that way also gives you a chance to re-connect text that was separated by the header or footer at the page break.