MobileRead Forums - View Single Post - PDF -> ePUB: deleting s Best Practices

ogassav · 07-04-2019, 05:45 AM

Dear All,

I'm new to Calibre, however those of you who are not surely know about the problem of broken lines when converting PDF to ePUB. codes appear wherever they want to and split text into thousands of passages which looks weird.

This article (https://dearauthor.com/ebooks/calibr...nversion-tips/) suggests using Heuristic Processing during conversion to get rid of s, but it didn't work for me - I used the range from 0.4 to 0.6 with absolutely no result.

The same article proposes to use Search & Replace function and it was a solution in my case! I used the following logic: \. + (*SKIP)(*FAIL)|\ |\d + 

I assumed that s after dot (".") were an author-defined start of the new passage, so i didn't touch them (\. + (*SKIP)), while standalone s (\ ) and s which follow any word (\d + ) were replaced with nothing (= deleted), as almost always they were breaking sentence into useless passages.

Everything would have been prefectly fine, except one thing: the above-mentioned algorythm deletes "useful" s after headlines, which are usually highlighted with code (THIS IS HEADLINE ) and paragraphs (chapters???), which are highlighted with <a id> code (<a id="p8"></a> ).

So, what I need is to add an exception to my algorythm so that s are not deleted when they follow </a> and codes. I played around with quite a number of different variants, but still can't find my Grails. Possibly (*SKIP)(*FAIL) architecture does not suppose multiple skip logic: I ignore 1 parameter from the very beginning and want to add 2 more - so finally 3 in total.

Any thoughts?

07-04-2019, 05:45 AM	#1
ogassav Junior Member Posts: 5 Karma: 10 Join Date: Jul 2019 Device: Android	PDF -> ePUB: deleting <BR>s Best Practices Dear All, I'm new to Calibre, however those of you who are not surely know about the problem of broken lines when converting PDF to ePUB. <BR> codes appear wherever they want to and split text into thousands of passages which looks weird. This article (https://dearauthor.com/ebooks/calibr...nversion-tips/) suggests using Heuristic Processing during conversion to get rid of <BR>s, but it didn't work for me - I used the range from 0.4 to 0.6 with absolutely no result. The same article proposes to use Search & Replace function and it was a solution in my case! I used the following logic: \. +<br>(SKIP)(FAIL)\|\<br>\|\d +<br> I assumed that <BR>s after dot (".") were an author-defined start of the new passage, so i didn't touch them (\. +<br>(SKIP)), while standalone <BR>s (\<br>) and <BR>s which follow any word (\d +<br>) were replaced with nothing (= deleted), as almost always they* were breaking sentence into useless passages. Everything would have been prefectly fine, except one thing: the above-mentioned algorythm deletes "useful" <BR>s after headlines, which are usually highlighted with <b> code (<b>THIS IS HEADLINE </b><br>) and paragraphs (chapters???), which are highlighted with <a id> code (<a id="p8"></a> <br>). So, what I need is to add an exception to my algorythm so that <BR>s are not deleted when they follow </a> and </b> codes. I played around with quite a number of different variants, but still can't find my Grails. Possibly *(SKIP)(FAIL)* architecture does not suppose multiple skip logic: I ignore 1 parameter from the very beginning and want to add 2 more - so finally 3 in total. Any thoughts? Last edited by ogassav; 07-04-2019 at 05:49 AM.