09-11-2021, 06:17 AM | #1 |
Junior Member
Posts: 1
Karma: 10
Join Date: Sep 2021
Device: none
|
pdf to epub regex unicode character match not working
Hello, I'm trying to convert a pdf book to epub that has a header and a footer I'd like to remove. The header has the chapter name, the symbol • and the page number, for example “Chapter 3. Interfacing with Humans • 41” and the footer is "report erratum • discuss".
I've have tried a few ways to match this header and footer: /.+ • [0-9]+$/g report erratum • discuss /.+ \u2022 [0-9]+$/g report erratum \u2022 discuss /.+ \W [0-9]+$/g report erratum \W discuss but non of these work, I would be glad if someone could help, thanks! I'm using sr1-search and sr2-search using the ebook-convert cli. |
09-11-2021, 06:30 AM | #2 |
Well trained by Cats
Posts: 29,818
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
are your slashes are wrong for Calibre PCRE
Code:
\.+ \g |
09-11-2021, 03:16 PM | #3 |
Addict
Posts: 387
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Jutoh, Kobo Forma
|
I would try something more simple:
chapter.*?\d+ and report.*?•.*?discuss But I have found that headers and footers in OCR'd pdfs often come across with strange spacing, text scannos, and all sorts of cruft. Often you get some, but not others. So I do this in the Editor, after conversion, where I have a chance to find the exceptions. Doing it that way also gives you a chance to re-connect text that was separated by the header or footer at the page break. |
Tags |
caliber, ebook-convert, pdf-to-epub, regex, unicode |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can you match a NON-character...? | ElMiko | Sigil | 10 | 03-26-2013 10:18 PM |
Can't match Unicode character | atordo | Recipes | 2 | 06-15-2012 03:20 PM |
how to have regex dot match any character including newline? | gnychis | Calibre | 5 | 11-30-2010 06:35 PM |
How to create non-embedded Unicode EPUB,LRF,TXT,RTF,PDF | alexmobile | Sony Reader | 1 | 09-23-2009 10:04 PM |
Glyph Substitution of Unicode character | vdevan | OpenInkpot | 2 | 07-18-2009 05:54 PM |