MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   Conversion (https://www.mobileread.com/forums/forumdisplay.php?f=235)
-   -   PDF -> ePUB: deleting <BR>s Best Practices (https://www.mobileread.com/forums/showthread.php?t=321260)

ogassav 07-04-2019 05:45 AM

PDF -> ePUB: deleting <BR>s Best Practices
 
Dear All,

I'm new to Calibre, however those of you who are not surely know about the problem of broken lines when converting PDF to ePUB. <BR> codes appear wherever they want to and split text into thousands of passages which looks weird.

This article (https://dearauthor.com/ebooks/calibr...nversion-tips/) suggests using Heuristic Processing during conversion to get rid of <BR>s, but it didn't work for me - I used the range from 0.4 to 0.6 with absolutely no result.

The same article proposes to use Search & Replace function and it was a solution in my case! I used the following logic: \. +<br>(*SKIP)(*FAIL)|\<br>|\d +<br>

I assumed that <BR>s after dot (".") were an author-defined start of the new passage, so i didn't touch them (\. +<br>(*SKIP)), while standalone <BR>s (\<br>) and <BR>s which follow any word (\d +<br>) were replaced with nothing (= deleted), as almost always they were breaking sentence into useless passages.

Everything would have been prefectly fine, except one thing: the above-mentioned algorythm deletes "useful" <BR>s after headlines, which are usually highlighted with <b> code (<b>THIS IS HEADLINE </b><br>) and paragraphs (chapters???), which are highlighted with <a id> code (<a id="p8"></a> <br>).

So, what I need is to add an exception to my algorythm so that <BR>s are not deleted when they follow </a> and </b> codes. I played around with quite a number of different variants, but still can't find my Grails. Possibly (*SKIP)(*FAIL) architecture does not suppose multiple skip logic: I ignore 1 parameter from the very beginning and want to add 2 more - so finally 3 in total.

Any thoughts?

theducks 07-04-2019 10:55 PM

My opinion is to NOT try and clean complex issues with conversion. Convert to EPUB or AZW3 and use the editor Search and replace to SELECTIVELY remove BR's (some are wanted, like in the headings). Then there may also be the case of BR BR, which may be a scene break and need a different treatment (do this first, then the singles)

ogassav 07-05-2019 03:26 AM

Dear theducks,

while i totally agree with you regarding flaws of "bulk" removement of BRs with Search&Replace function, i'm fine with certain mistakes left in the text, as it is supposed for my personal use only.

Do you have an idea of implementation of additional skip logic to the formula i've mentioned above?

theducks 07-05-2019 10:02 AM

Quote:

Originally Posted by ogassav (Post 3863933)
Dear theducks,


Do you have an idea of implementation of additional skip logic to the formula i've mentioned above?

Nope.
I had no reason to develop automated tools. I have a Library of saved searches (in Sigil) that I draw from (past efforts :D ) since it seems every books needs something slightly different anyway.

kovidgoyal 07-05-2019 10:49 AM

what you need for this kind of thing are look behind assertions in the regular expression.

ogassav 07-05-2019 01:47 PM

Quote:

Originally Posted by kovidgoyal (Post 3864039)
what you need for this kind of thing are look behind assertions in the regular expression.

Mmm, are they described in Calibre help somewhere? Couldn't find them. :bookworm: Google said these assertions are used in Java and Python and i'm not a programmer at all... :help:

theducks 07-05-2019 03:01 PM

Quote:

Originally Posted by ogassav (Post 3864113)
Mmm, are they described in Calibre help somewhere? Couldn't find them. :bookworm: Google said these assertions are used in Java and Python and i'm not a programmer at all... :help:

They are PCRE flavor of REGEX. That is where you look.
:bulb2: There is a app called Regex buddy (for Windows) It ain't free ($40), but if you are short on hair :D

kovidgoyal 07-05-2019 11:12 PM

https://manual.calibre-ebook.com/regexp.html

Divingduck 07-06-2019 04:28 AM

in addition this is as well helpful:
https://www.regular-expressions.info/lookaround.html

ogassav 07-06-2019 04:31 AM

OK guys, looks like there's misunderstsanding here. I perfectly know what i need to implement in my formula: the logic which excludes 2 types of <BR>s. Call it skip logic, look behind assertions, ignore principles - whatever.

The problem is that i don't know how to translate this logic into Calibre language of regular expressions. So finally, the message of my post is "Is there anyone familiar with this kinda programming here? I've worked on some formula and got stuck on a certain stage - need your help badly". And believe me i've studied Calibre language help already and tried several variants with no result and i've wrote it in my very first post - so i tried to do something myself before asking for help, so just pushing me in the direction of User Manual is not what i really expect from the community in cases like this.

theducks 07-06-2019 11:27 AM

Calibre uses the PCRE dialect of REGEX

deback 07-13-2019 09:51 AM

Quote:

This article (https://dearauthor.com/ebooks/calibr...nversion-tips/) suggests using Heuristic Processing during conversion to get rid of <BR>s, but it didn't work for me - I used the range from 0.4 to 0.6 with absolutely no result.
Try using 0.22 as the factor under Heuristic Processing. You should see a difference in the result regarding split paragraphs.


All times are GMT -4. The time now is 09:05 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.