07-04-2019, 04:45 AM | #1 |
Junior Member
Posts: 5
Karma: 10
Join Date: Jul 2019
Device: Android
|
PDF -> ePUB: deleting <BR>s Best Practices
Dear All,
I'm new to Calibre, however those of you who are not surely know about the problem of broken lines when converting PDF to ePUB. <BR> codes appear wherever they want to and split text into thousands of passages which looks weird. This article (https://dearauthor.com/ebooks/calibr...nversion-tips/) suggests using Heuristic Processing during conversion to get rid of <BR>s, but it didn't work for me - I used the range from 0.4 to 0.6 with absolutely no result. The same article proposes to use Search & Replace function and it was a solution in my case! I used the following logic: \. +<br>(*SKIP)(*FAIL)|\<br>|\d +<br> I assumed that <BR>s after dot (".") were an author-defined start of the new passage, so i didn't touch them (\. +<br>(*SKIP)), while standalone <BR>s (\<br>) and <BR>s which follow any word (\d +<br>) were replaced with nothing (= deleted), as almost always they were breaking sentence into useless passages. Everything would have been prefectly fine, except one thing: the above-mentioned algorythm deletes "useful" <BR>s after headlines, which are usually highlighted with <b> code (<b>THIS IS HEADLINE </b><br>) and paragraphs (chapters???), which are highlighted with <a id> code (<a id="p8"></a> <br>). So, what I need is to add an exception to my algorythm so that <BR>s are not deleted when they follow </a> and </b> codes. I played around with quite a number of different variants, but still can't find my Grails. Possibly (*SKIP)(*FAIL) architecture does not suppose multiple skip logic: I ignore 1 parameter from the very beginning and want to add 2 more - so finally 3 in total. Any thoughts? Last edited by ogassav; 07-04-2019 at 04:49 AM. |
07-04-2019, 09:55 PM | #2 |
Well trained by Cats
Posts: 30,370
Karma: 58053698
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
My opinion is to NOT try and clean complex issues with conversion. Convert to EPUB or AZW3 and use the editor Search and replace to SELECTIVELY remove BR's (some are wanted, like in the headings). Then there may also be the case of BR BR, which may be a scene break and need a different treatment (do this first, then the singles)
|
Advert | |
|
07-05-2019, 02:26 AM | #3 |
Junior Member
Posts: 5
Karma: 10
Join Date: Jul 2019
Device: Android
|
Dear theducks,
while i totally agree with you regarding flaws of "bulk" removement of BRs with Search&Replace function, i'm fine with certain mistakes left in the text, as it is supposed for my personal use only. Do you have an idea of implementation of additional skip logic to the formula i've mentioned above? Last edited by ogassav; 07-05-2019 at 02:28 AM. |
07-05-2019, 09:02 AM | #4 | |
Well trained by Cats
Posts: 30,370
Karma: 58053698
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
I had no reason to develop automated tools. I have a Library of saved searches (in Sigil) that I draw from (past efforts ) since it seems every books needs something slightly different anyway. |
|
07-05-2019, 09:49 AM | #5 |
creator of calibre
Posts: 44,336
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
what you need for this kind of thing are look behind assertions in the regular expression.
|
Advert | |
|
07-05-2019, 12:47 PM | #6 |
Junior Member
Posts: 5
Karma: 10
Join Date: Jul 2019
Device: Android
|
|
07-05-2019, 02:01 PM | #7 | |
Well trained by Cats
Posts: 30,370
Karma: 58053698
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
There is a app called Regex buddy (for Windows) It ain't free ($40), but if you are short on hair |
|
07-05-2019, 10:12 PM | #8 |
creator of calibre
Posts: 44,336
Karma: 23661992
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
07-06-2019, 03:28 AM | #9 |
Wizard
Posts: 1,165
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
|
in addition this is as well helpful:
https://www.regular-expressions.info/lookaround.html |
07-06-2019, 03:31 AM | #10 |
Junior Member
Posts: 5
Karma: 10
Join Date: Jul 2019
Device: Android
|
OK guys, looks like there's misunderstsanding here. I perfectly know what i need to implement in my formula: the logic which excludes 2 types of <BR>s. Call it skip logic, look behind assertions, ignore principles - whatever.
The problem is that i don't know how to translate this logic into Calibre language of regular expressions. So finally, the message of my post is "Is there anyone familiar with this kinda programming here? I've worked on some formula and got stuck on a certain stage - need your help badly". And believe me i've studied Calibre language help already and tried several variants with no result and i've wrote it in my very first post - so i tried to do something myself before asking for help, so just pushing me in the direction of User Manual is not what i really expect from the community in cases like this. Last edited by ogassav; 07-06-2019 at 03:34 AM. |
07-06-2019, 10:27 AM | #11 |
Well trained by Cats
Posts: 30,370
Karma: 58053698
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Calibre uses the PCRE dialect of REGEX
|
07-13-2019, 08:51 AM | #12 | |
Book E d i t o r
Posts: 432
Karma: 288184
Join Date: May 2015
Device: Laptop
|
Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
My first EPUB! Need advice on best practices | fluoresce | ePub | 31 | 05-03-2017 11:08 AM |
Page Margin Best Practices epub->mobi | BKh | Conversion | 0 | 08-09-2012 12:11 PM |
TOC best practices (InDesign to ePUb) | virtual_ink | ePub | 3 | 07-03-2011 01:50 PM |
Converting cyrillic files to epub, best practices? | Fking | Calibre | 6 | 01-09-2011 06:06 AM |
EPUB best practices guide | Bob Russell | ePub | 25 | 04-01-2008 08:36 AM |