View Single Post
Old 05-06-2022, 11:31 AM   #82
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,252
Karma: 16544692
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
Quote:
Originally Posted by CyberPaul View Post
@jackie_w
This is the KEPUB code:

I think it is pretty much as you described, right?
Yes.

Quote:
Originally Posted by CyberPaul View Post
What I do not understand is why the algorithm is insisting on that specific sequence (Ma...)? Why not adding spaces within other words? I think it can be related to periods interpreted as single words, because usually it is a separator of words.
If you're asking how the kepub reading app decides where to "inject unwanted spaces" - the simple answer is I don't know, other than it must feel it's necessary to get the neatly justified right edge. Why it would think it's OK to create spaces within a word, rather than adding more space to the existing gaps between words, is anyone's guess.

If you're asking why the kepub creation algorithm chooses to fragment paragraphs the way it does - it's for koboSpan purposes. It tries to create (at least) one per sentence. However a koboSpan must only contain text, not other tags, so it has to end the old one and start a new one when it encounters inline tags such as <i>, <em>, <span>, ... etc. in the middle of a sentence. On top of that, the algorithm used to determine what will be considered 'end of sentence' is a somewhat simplistic list of punctuation characters (period, colon, ellipsis, ... etc) in a regex search.

With current tools, books using a lot of any of the following will have a lot of unnecessary (IMO) koboSpan fragmentation during kepub creation:
  • 3 consecutive periods (...) instead of a single ellipsis (…)
  • abbreviations, Mr. Mrs. Dr. U.S.A. U.K.
  • time-related, A.M. P.M. 12:30

Last edited by jackie_w; 05-06-2022 at 11:37 AM. Reason: typo
jackie_w is offline   Reply With Quote