pdf to epub regex unicode character match not working

marcio_oliveira · 09-11-2021, 06:17 AM

Hello, I'm trying to convert a pdf book to epub that has a header and a footer I'd like to remove. The header has the chapter name, the symbol • and the page number, for example “Chapter 3. Interfacing with Humans • 41” and the footer is "report erratum • discuss".

I've have tried a few ways to match this header and footer:
/.+ • [0-9]+$/g
report erratum • discuss

/.+ \u2022 [0-9]+$/g
report erratum \u2022 discuss

/.+ \W [0-9]+$/g
report erratum \W discuss

but non of these work, I would be glad if someone could help, thanks!

I'm using sr1-search and sr2-search using the ebook-convert cli.

theducks · 09-11-2021, 06:30 AM

are your slashes are wrong for Calibre PCRE

Code:

\.+
\g

and have you tried using \s (for any kind of space)

retiredbiker · 09-11-2021, 03:16 PM

I would try something more simple:

chapter.*?\d+
and
report.*?•.*?discuss

But I have found that headers and footers in OCR'd pdfs often come across with strange spacing, text scannos, and all sorts of cruft. Often you get some, but not others. So I do this in the Editor, after conversion, where I have a chance to find the exceptions. Doing it that way also gives you a chance to re-connect text that was separated by the header or footer at the page break.

09-11-2021, 06:17 AM	#1
marcio_oliveira Junior Member Posts: 1 Karma: 10 Join Date: Sep 2021 Device: none	pdf to epub regex unicode character match not working Hello, I'm trying to convert a pdf book to epub that has a header and a footer I'd like to remove. The header has the chapter name, the symbol • and the page number, for example “Chapter 3. Interfacing with Humans • 41” and the footer is "report erratum • discuss". I've have tried a few ways to match this header and footer: /.+ • [0-9]+$/g report erratum • discuss /.+ \u2022 [0-9]+$/g report erratum \u2022 discuss /.+ \W [0-9]+$/g report erratum \W discuss but non of these work, I would be glad if someone could help, thanks! I'm using sr1-search and sr2-search using the ebook-convert cli.

09-11-2021, 06:30 AM	#2
theducks Well trained by Cats Posts: 29,818 Karma: 54830978 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	are your slashes are wrong for Calibre PCRE Code: \.+ \g and have you tried using \s (for any kind of space)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can you match a NON-character...?	ElMiko	Sigil	10	03-26-2013 10:18 PM
Can't match Unicode character	atordo	Recipes	2	06-15-2012 03:20 PM
how to have regex dot match any character including newline?	gnychis	Calibre	5	11-30-2010 06:35 PM
How to create non-embedded Unicode EPUB,LRF,TXT,RTF,PDF	alexmobile	Sony Reader	1	09-23-2009 10:04 PM
Glyph Substitution of Unicode character	vdevan	OpenInkpot	2	07-18-2009 05:54 PM

09-11-2021, 03:16 PM	#3
retiredbiker Addict Posts: 387 Karma: 1638210 Join Date: May 2013 Location: Ontario, Canada Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma	I would try something more simple: chapter.?\d+ and report.?•.*?discuss But I have found that headers and footers in OCR'd pdfs often come across with strange spacing, text scannos, and all sorts of cruft. Often you get some, but not others. So I do this in the Editor, after conversion, where I have a chance to find the exceptions. Doing it that way also gives you a chance to re-connect text that was separated by the header or footer at the page break.