Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 07-04-2019, 05:45 AM   #1
ogassav
Junior Member
ogassav began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jul 2019
Device: Android
Question PDF -> ePUB: deleting <BR>s Best Practices

Dear All,

I'm new to Calibre, however those of you who are not surely know about the problem of broken lines when converting PDF to ePUB. <BR> codes appear wherever they want to and split text into thousands of passages which looks weird.

This article (https://dearauthor.com/ebooks/calibr...nversion-tips/) suggests using Heuristic Processing during conversion to get rid of <BR>s, but it didn't work for me - I used the range from 0.4 to 0.6 with absolutely no result.

The same article proposes to use Search & Replace function and it was a solution in my case! I used the following logic: \. +<br>(*SKIP)(*FAIL)|\<br>|\d +<br>

I assumed that <BR>s after dot (".") were an author-defined start of the new passage, so i didn't touch them (\. +<br>(*SKIP)), while standalone <BR>s (\<br>) and <BR>s which follow any word (\d +<br>) were replaced with nothing (= deleted), as almost always they were breaking sentence into useless passages.

Everything would have been prefectly fine, except one thing: the above-mentioned algorythm deletes "useful" <BR>s after headlines, which are usually highlighted with <b> code (<b>THIS IS HEADLINE </b><br>) and paragraphs (chapters???), which are highlighted with <a id> code (<a id="p8"></a> <br>).

So, what I need is to add an exception to my algorythm so that <BR>s are not deleted when they follow </a> and </b> codes. I played around with quite a number of different variants, but still can't find my Grails. Possibly (*SKIP)(*FAIL) architecture does not suppose multiple skip logic: I ignore 1 parameter from the very beginning and want to add 2 more - so finally 3 in total.

Any thoughts?

Last edited by ogassav; 07-04-2019 at 05:49 AM.
ogassav is offline   Reply With Quote
Old 07-04-2019, 10:55 PM   #2
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 23,992
Karma: 27923385
Join Date: Aug 2009
Location: The Central Coast of California
Device: K4NT, Galaxy Tab A, Kobo Aura2
My opinion is to NOT try and clean complex issues with conversion. Convert to EPUB or AZW3 and use the editor Search and replace to SELECTIVELY remove BR's (some are wanted, like in the headings). Then there may also be the case of BR BR, which may be a scene break and need a different treatment (do this first, then the singles)
theducks is offline   Reply With Quote
Old 07-05-2019, 03:26 AM   #3
ogassav
Junior Member
ogassav began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jul 2019
Device: Android
Dear theducks,

while i totally agree with you regarding flaws of "bulk" removement of BRs with Search&Replace function, i'm fine with certain mistakes left in the text, as it is supposed for my personal use only.

Do you have an idea of implementation of additional skip logic to the formula i've mentioned above?

Last edited by ogassav; 07-05-2019 at 03:28 AM.
ogassav is offline   Reply With Quote
Old 07-05-2019, 10:02 AM   #4
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 23,992
Karma: 27923385
Join Date: Aug 2009
Location: The Central Coast of California
Device: K4NT, Galaxy Tab A, Kobo Aura2
Quote:
Originally Posted by ogassav View Post
Dear theducks,


Do you have an idea of implementation of additional skip logic to the formula i've mentioned above?
Nope.
I had no reason to develop automated tools. I have a Library of saved searches (in Sigil) that I draw from (past efforts ) since it seems every books needs something slightly different anyway.
theducks is offline   Reply With Quote
Old 07-05-2019, 10:49 AM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 35,456
Karma: 12734961
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
what you need for this kind of thing are look behind assertions in the regular expression.
kovidgoyal is offline   Reply With Quote
Old 07-05-2019, 01:47 PM   #6
ogassav
Junior Member
ogassav began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jul 2019
Device: Android
Quote:
Originally Posted by kovidgoyal View Post
what you need for this kind of thing are look behind assertions in the regular expression.
Mmm, are they described in Calibre help somewhere? Couldn't find them. Google said these assertions are used in Java and Python and i'm not a programmer at all...
ogassav is offline   Reply With Quote
Old 07-05-2019, 03:01 PM   #7
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 23,992
Karma: 27923385
Join Date: Aug 2009
Location: The Central Coast of California
Device: K4NT, Galaxy Tab A, Kobo Aura2
Quote:
Originally Posted by ogassav View Post
Mmm, are they described in Calibre help somewhere? Couldn't find them. Google said these assertions are used in Java and Python and i'm not a programmer at all...
They are PCRE flavor of REGEX. That is where you look.
There is a app called Regex buddy (for Windows) It ain't free ($40), but if you are short on hair
theducks is offline   Reply With Quote
Old 07-05-2019, 11:12 PM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 35,456
Karma: 12734961
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
https://manual.calibre-ebook.com/regexp.html
kovidgoyal is offline   Reply With Quote
Old 07-06-2019, 04:28 AM   #9
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,143
Karma: 1404167
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
in addition this is as well helpful:
https://www.regular-expressions.info/lookaround.html
Divingduck is offline   Reply With Quote
Old 07-06-2019, 04:31 AM   #10
ogassav
Junior Member
ogassav began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jul 2019
Device: Android
OK guys, looks like there's misunderstsanding here. I perfectly know what i need to implement in my formula: the logic which excludes 2 types of <BR>s. Call it skip logic, look behind assertions, ignore principles - whatever.

The problem is that i don't know how to translate this logic into Calibre language of regular expressions. So finally, the message of my post is "Is there anyone familiar with this kinda programming here? I've worked on some formula and got stuck on a certain stage - need your help badly". And believe me i've studied Calibre language help already and tried several variants with no result and i've wrote it in my very first post - so i tried to do something myself before asking for help, so just pushing me in the direction of User Manual is not what i really expect from the community in cases like this.

Last edited by ogassav; 07-06-2019 at 04:34 AM.
ogassav is offline   Reply With Quote
Old 07-06-2019, 11:27 AM   #11
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 23,992
Karma: 27923385
Join Date: Aug 2009
Location: The Central Coast of California
Device: K4NT, Galaxy Tab A, Kobo Aura2
Calibre uses the PCRE dialect of REGEX
theducks is offline   Reply With Quote
Old 07-13-2019, 09:51 AM   #12
deback
Book E d i t o r
deback knows the complete value of PI to the enddeback knows the complete value of PI to the enddeback knows the complete value of PI to the enddeback knows the complete value of PI to the enddeback knows the complete value of PI to the enddeback knows the complete value of PI to the enddeback knows the complete value of PI to the enddeback knows the complete value of PI to the enddeback knows the complete value of PI to the enddeback knows the complete value of PI to the enddeback knows the complete value of PI to the end
 
Posts: 341
Karma: 31930
Join Date: May 2015
Device: Laptop
Quote:
This article (https://dearauthor.com/ebooks/calibr...nversion-tips/) suggests using Heuristic Processing during conversion to get rid of <BR>s, but it didn't work for me - I used the range from 0.4 to 0.6 with absolutely no result.
Try using 0.22 as the factor under Heuristic Processing. You should see a difference in the result regarding split paragraphs.
deback is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
My first EPUB! Need advice on best practices fluoresce ePub 31 05-03-2017 12:08 PM
Page Margin Best Practices epub->mobi BKh Conversion 0 08-09-2012 01:11 PM
TOC best practices (InDesign to ePUb) virtual_ink ePub 3 07-03-2011 02:50 PM
Converting cyrillic files to epub, best practices? Fking Calibre 6 01-09-2011 07:06 AM
EPUB best practices guide Bob Russell ePub 25 04-01-2008 09:36 AM


All times are GMT -4. The time now is 09:51 AM.


MobileRead.com is a privately owned, operated and funded community.