Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 09-11-2021, 06:17 AM   #1
marcio_oliveira
Junior Member
marcio_oliveira began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Sep 2021
Device: none
Question pdf to epub regex unicode character match not working

Hello, I'm trying to convert a pdf book to epub that has a header and a footer I'd like to remove. The header has the chapter name, the symbol • and the page number, for example “Chapter 3. Interfacing with Humans • 41” and the footer is "report erratum • discuss".

I've have tried a few ways to match this header and footer:
/.+ • [0-9]+$/g
report erratum • discuss

/.+ \u2022 [0-9]+$/g
report erratum \u2022 discuss

/.+ \W [0-9]+$/g
report erratum \W discuss

but non of these work, I would be glad if someone could help, thanks!

I'm using sr1-search and sr2-search using the ebook-convert cli.
marcio_oliveira is offline   Reply With Quote
Old 09-11-2021, 06:30 AM   #2
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 31,238
Karma: 61360164
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
are your slashes are wrong for Calibre PCRE
Code:
\.+
\g
and have you tried using \s (for any kind of space)
theducks is offline   Reply With Quote
Advert
Old 09-11-2021, 03:16 PM   #3
retiredbiker
Evangelist
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 454
Karma: 3886916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma
I would try something more simple:

chapter.*?\d+
and
report.*?•.*?discuss

But I have found that headers and footers in OCR'd pdfs often come across with strange spacing, text scannos, and all sorts of cruft. Often you get some, but not others. So I do this in the Editor, after conversion, where I have a chance to find the exceptions. Doing it that way also gives you a chance to re-connect text that was separated by the header or footer at the page break.
retiredbiker is offline   Reply With Quote
Reply

Tags
caliber, ebook-convert, pdf-to-epub, regex, unicode

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Can you match a NON-character...? ElMiko Sigil 10 03-26-2013 10:18 PM
Can't match Unicode character atordo Recipes 2 06-15-2012 03:20 PM
how to have regex dot match any character including newline? gnychis Calibre 5 11-30-2010 06:35 PM
How to create non-embedded Unicode EPUB,LRF,TXT,RTF,PDF alexmobile Sony Reader 1 09-23-2009 10:04 PM
Glyph Substitution of Unicode character vdevan OpenInkpot 2 07-18-2009 05:54 PM


All times are GMT -4. The time now is 06:15 PM.


MobileRead.com is a privately owned, operated and funded community.