Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 03-26-2021, 07:05 PM   #1
phossler
Wizard
phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.
 
Posts: 1,071
Karma: 412718
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
RegEx: Insert nbs between initials, etc.

Looking to see if there is a RegEx that will help to insert non-breaking spaces to join names, etc. so as to avoid a line wrap in the middle

For example. in the fragment below, I'd like to insert NBS after the 's' and before the "B' so that 'Thomas J. Beale' doesn't line wrap

Quote:
by the name of Thomas J. Beale rode into Lynchburg and checked into the Washington Hotel.
I think I'll need two passes

#1 - A. B. Charles -- > A.nbsB.nbsC

Find: ([A-Z])\. ([A-Z])\. ([A-Z])
Replace: \1.\xA0\2.\3

#2 - A. Charles --> A.nbsC

Find: ([A-Z])\. ([A-Z])
Replace: \1.\xA0\2

The problem -- If a sentence ends with an upper case letter, it finds it

blah blah end of sentence T. Next sentence

Is there a better way to do the RE?

Eventually I'd like to extend this to joining dates. Maybe some others

March 22, 2021 --> Marchnbs22,nbs2021
phossler is offline   Reply With Quote
Old 03-27-2021, 01:01 PM   #2
retiredbiker
Addict
retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.retiredbiker ought to be getting tired of karma fortunes by now.
 
retiredbiker's Avatar
 
Posts: 371
Karma: 1638210
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Ubuntu, Jutoh,Kobo Forma
You are on the right track, but in this case I can't imagine a search where you would want to run a "replace all". The possibilities of non-name text fitting the search are pretty large no matter what. No regex string can tell if a word is a person's name.

For the "Thomas J. Beale" sorts of case, ([a-z])\. ([A-Z])\. ([A-Z]) with match case checked should work OK. Your other searches look OK, but I'd do "find and replace" throughout the book for any of them rather than "replace all". In your "A. B. Charles" case, I think you want another nbsp to go be tween the B and C in the replace string.
retiredbiker is offline   Reply With Quote
Advert
Old 03-27-2021, 02:46 PM   #3
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,579
Karma: 54344444
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
How about A.M. or P.M.
Really, do not use replace all. You can verify every Find (a skip)/Replace Find pretty fast. I keep Preview up on a monitor to validate 'context' issues
theducks is offline   Reply With Quote
Old 03-27-2021, 04:57 PM   #4
phossler
Wizard
phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.
 
Posts: 1,071
Karma: 412718
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
Quote:
Originally Posted by retiredbiker View Post
In your "A. B. Charles" case, I think you want another nbsp to go be tween the B and C in the replace string.
Yea, I was testing, copy, pasting all at once (you'd think I'd know better) and messed up

@retiredbiker
@theducks

While I'd like to be able to do a [Replace All] I didn't trust my RE, so I figured I'd ask more experienced users if there was more advanced way

A [Replace and find] takes longer but is safer
phossler is offline   Reply With Quote
Old 03-28-2021, 07:44 PM   #5
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by phossler View Post
I think I'll need two passes
Yep, that sounds about right.

Quote:
Originally Posted by retiredbiker View Post
For the "Thomas J. Beale" sorts of case, ([a-z])\. ([A-Z])\. ([A-Z]) with match case checked should work OK. Your other searches look OK, but I'd do "find and replace" throughout the book for any of them rather than "replace all". In your "A. B. Charles" case, I think you want another nbsp to go be tween the B and C in the replace string.
Agreed.

That's ~ the regex I use as well... except I use regex to normalize "First. Middle." into a single chunk:

F. A. Hayek -> F.A. Hayek
W. E. B. Du Bois -> W.E.B. Du Bois

or normalizing states/acronyms/times:

C. A. -> C.A.
N. Y. C. -> N.Y.C.
A. M. -> A.M.

Quote:
Originally Posted by retiredbiker View Post
You are on the right track, but in this case I can't imagine a search where you would want to run a "replace all". The possibilities of non-name text fitting the search are pretty large no matter what. No regex string can tell if a word is a person's name.
Exactly. Needs to be looked at on a case-by-case basis.*

Regex alone is "too dumb". To lower the errors, you'd need something that can actually parse the sentence structure.

Antidote is a grammarchecker, and is the only one I know of that can detect/combine First + Middle + Last Name (along with units + dates/times + [...]).

See their list of space detections:

https://documentation.antidote.info/...s/spaces-panel

Antidote was designed for French first, where "non-breaking thin spaces" are used all over the place around punctuation.

Side Note: I wrote a detailed analysis of Antidote in:

I also discussed a few similar regexes over the years (like ALL CAPS->Smallcaps or Roman Numerals):

Side Note #2: You may also be able to hackishly use Spellcheck Lists:

I explained multiple methods to combine "e m p h a s i s" into "emphasis".

* Note: What I wrote in Post #12 in the topic above still applies:

Quote:
Originally Posted by Tex2002ans View Post
You can use Regex to do the vast bulk of the corrections, then manually fix the edge cases.

Better/faster to do:
  • 95% correct with a 2-step regex.
  • 5% manually find/correct/fix.

than:
  • 100% manually fix.
So it's up to you where you want to spend your time and do your fixing.

Quote:
Originally Posted by phossler View Post
Eventually I'd like to extend this to joining dates. Maybe some others

March 22, 2021 --> Marchnbs22,nbs2021
To detect dates, I use these:

Search: (Jan|Feb|Mar|Apr|Aug|Sept|Oct|Nov|Dec)\. (\d)

Search: (January|February|March|April|May|June|July|August |September|October|November|December) (\d{1,2}),

They can be adjusted as needed.

Last edited by Tex2002ans; 03-28-2021 at 08:32 PM.
Tex2002ans is offline   Reply With Quote
Advert
Old 03-29-2021, 09:12 AM   #6
phossler
Wizard
phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.
 
Posts: 1,071
Karma: 412718
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
Quote:
Originally Posted by Tex2002ans View Post
That's ~ the regex I use as well... except I use regex to normalize "First. Middle." into a single chunk:

F. A. Hayek -> F.A. Hayek
W. E. B. Du Bois -> W.E.B. Du Bois

or normalizing states/acronyms/times:

C. A. -> C.A.
N. Y. C. -> N.Y.C.
A. M. -> A.M.
Some good ideas there - thanks

Q1: Are there rules or is it personal choice that F. A. Hayek -> F.A. Hayek is correct? I typically use F. A. Hayek with spaces

Q2: I'd think that CA and NYC would be the correct acronym?

Q3: Related to Q1. I'm on the fence about A.M. vs A. M.. I think I prefer the no-space version, but are there any rules?
phossler is offline   Reply With Quote
Old 03-29-2021, 12:39 PM   #7
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,579
Karma: 54344444
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
I have 3 names: Fn Mn Ln not FnMnLn or Fi.Mi.Ln
so I take it Personal when you get it wrong
(My real initials form another common nickname if not separated)

I use the Quality Check PI to Fix authors initials to all have spaces
theducks is offline   Reply With Quote
Old 03-29-2021, 12:44 PM   #8
phossler
Wizard
phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.
 
Posts: 1,071
Karma: 412718
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
So would you be/prefer

Fi. Mi. Ln

or

Fi.Mi. Ln

?
phossler is offline   Reply With Quote
Old 03-29-2021, 02:36 PM   #9
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,579
Karma: 54344444
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by phossler View Post
So would you be/prefer

Fi. Mi. Ln

or

Fi.Mi. Ln

?
With spaces (the first). , but regular if in text as NBS would make crummy justification on many devices. Trade-off (I reallyhate huge gaps in text)
theducks is offline   Reply With Quote
Old 03-29-2021, 06:57 PM   #10
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by theducks View Post
With spaces (the first). , but regular if in text as NBS would make crummy justification on many devices. Trade-off (I reallyhate huge gaps in text)


Personally, in ebooks, I would avoid using these non-breaking spaces. They're going to cause more headaches than they're worth.

But for Print, see some of the notes below.

Side Note: In ebooks, the only non-breaking spaces I use is in 3- and 4+-dot ellipses.

Quote:
Originally Posted by phossler View Post
Q1: Are there rules or is it personal choice that F. A. Hayek -> F.A. Hayek is correct? I typically use F. A. Hayek with spaces
The most important thing is consistency throughout.

But this all comes down to preferences (Style Guides / House Styles).

For example, the publishers I do a lot of work for prefer the "F.M. Last" form.

If I run across a book where 90% are one way, and 10% are the other way, better to choose one and make it consistent (and I personally lean towards no space).

If the book is 100% correct though, then I don't mess with it. (This is very rare though, these types of inconsistencies sneak in all the time, just like mismatching hyphenation.)

- - - - -

Pass #1: Take care of the "easy".

Search: (\b[A-Z]\.) ([A-Z]\.) ([A-Z])
Replace: \1\2 \3

This catches "F. M. Last" -> "F.M. Last".

Pass #2: Take care of the "hard" + all the rest.

Search: (\b[A-Z]\.) ([A-Z])
Replace: \1 \2

This will catch "First M. Last" -> "First M. Last".

Exceptions

Depending on how the rest of the book deals with acronyms, there may be very few errors, or a ton. Here are a few examples you may run across:

I ran this on a ~2 million word journal.

Pass #1:

Many styles shorten journal/newspaper names:

Quote:
I. Maurice Wormser, “The True Conception of Unilateral Contracts”, <i>Selected Readings</i>, pp. 307, 308–309. Compare Frederick Pollock, review of Clarence D. Ashley, <i>The Law of Contracts</i>, 28 <i>L. Q. Rev.</i> 100.
It would be wrong to merge this into "L.Q. Rev".

Pass #2:

U.S. (1 of these is a sentence-ender):

Quote:
This distinction was originally continued in America, as evidenced by the placement of the bankruptcy clause in the commerce section of the U.S. Constitution.

“A single, ultimate arbiter of conflicts (e.g. the U.S. Supreme Court) is considered non-essential.”

The book was a landmark for the development of sociology in the U.S. Spencer emphasized a science of sociology which would teach men to think of social causation in a scientific way.
(If you're writing a book, don't end the sentence on an acronym like that... It's very poor form. Rewrite sentence instead. [Or go with the superior "US"!])

B.C. (sentence-ender?):

Quote:
[...] the jurist-king of Babylonia who reigned in the 18th century B.C. Hammurabi provided for liquidation of the assets of the insolvent debtor and their distribution among creditors [...]
C.I.A. + F.B.I. (sentence-ender):

Quote:
[...] as is also the case with para-military organizations like the C.I.A. and F.B.I. This often surreptitious attrition of popular control runs counter [...]
As you can see, this is where Natural Language Processing would cut down on the false hits. You need something smarter than plain ol' Regex IF you want to do mass search/replaces.

Side Note: LanguageTool detects some of this stuff in the backend while parsing sentences so it can recommend proper capitalization, but doesn't (yet) recommend non-breaking spaces like Antidote:

https://languagetool.org/

And tools like NTLK (Natural Language Toolkit) may detect even more of these cases:

https://www.nltk.org/index.html

Quote:
Originally Posted by phossler View Post
Q2: I'd think that CA and NYC would be the correct acronym?
Depends on the source. Again, consistency within a document/book matters... but, you could have:

Book titles:

Quote:
Duncan Campbell, <i>War Plan UK: The Truth About Civil Defence in Britain</i> (London: Burnett, 1982).
Publisher locations:

Quote:
Bruno Leoni, <i>Freedom and the Law</i> (Princeton, N. J.: D. Van Nostrand, 1961)
Quite often, when discussing voting, the party+state is combined:

Quote:
Thomas Massie (R-KY)
Not smart to just Replace All or add/remove periods haphazardly. :P

Quote:
Originally Posted by phossler View Post
Q3: Related to Q1. I'm on the fence about A.M. vs A. M.. I think I prefer the no-space version, but are there any rules?
There are ~5 or 6 different styles for am/pm. See Posts #35–38:

https://www.mobileread.com/forums/sh...38#post3978438

(And I linked to a Wikipedia post describing some styles/conventions around the world.)

Side Note: Of course, just use the proper 24-hour clock and you won't have to deal with the am/pm nonsense!

Quote:
Originally Posted by phossler View Post
[...] but are there any rules?
If you're typesetting a physical print book though, there are some typographical guidelines you may want to follow. For example:

Titles:
  • Mr. Smith -> Mr.~Smith
  • Ms. Jones -> Ms.~Jones
  • Dr. John Ioannidis -> Dr.~John Ioannidis

Saint/Street:
  • St. Martin's Church -> St.~Martin's Church
  • Meet me at Main St. and Example Blvd.

Roman Numerals:
  • Charles V -> Charles~V
  • King Louis XIV -> King Louis~XIV

Chapter/Part/Appendix:
  • See Chapter 1 -> See Chapter~1
  • Appendix A -> Appendix~A

Pages:
  • See page 123 -> See page~123
  • See pp. 123–150 -> See pp.~123–150

Volume:
  • See vol. 1 -> See vol.~1

Editions:
  • <i>Title of Book</i>, 2nd ed. -> <i>Title of Book</i>, 2nd~ed.

Ordinals:
  • I scored 2nd place -> I scored 2nd~place

Variables:
  • Look at line <i>y</i> -> Look at line~<i>y</i>

Equation/Figure Numbers:
  • The solution is in Equation 1.2 -> The solution is in Equation~1.2
  • Fig. 1.2 shows you the location -> Fig.~1.2 shows you the location

Units:
  • Drive for 3.5 km -> Drive for 3.5~km
  • Run for 2000 m -> Run for 2000~m

Maths/Equations*:
  • a + b = c -> a~+~b~=~c

et al.
  • This genius example (Tex et al., 2021) -> This genius example (Tex et~al., 2021)

For a few more examples, see:

TeX Stack Exchange: "When should I use non-breaking space?"

* Note: Proper maths typography is a whole other can of worms...

- - - - -

You could also skim Robert Bringhurst's "The Elements of Typographic Style" or relevant Style Guides.

For example, Bringhurst's Chapter 2.1.5 recommends:

Quote:
2.1.5 Add little or no space within strings of initials.

Names such as W. B. Yeats and J. C. L. Prillwitz need hair spaces, thin spaces or no spaces at all after the intermediary periods. A normal word space follows the last period in the string.
Chapter 2.4.6:

Quote:
2.4.6 Link short numerical and mathematical expressions with hard spaces.

All you may see on the keyboard is a space bar, but typographers use several invisible characters: the word space, fixed spaces of various sizes (em space, en space, thin space, figure space, etc) and a hard space or no-break space. The hard space will stretch, like a normal word space, when the line is justified, but it will not convert to a linebreak. Hard spaces are useful for preventing linebreaks within phrases such as 6.2 mm, 3 in., 4 × 4, or in phrases like page 3 and chapter 5.

When it is necessary to break longer algebraic or numerical expressions, such as a + b = c, the break should come at the equal sign or another clear logical pause.
Chicago Manual of Style:

Quote:
9.20 Decimal places-European practice.

In European countries, except for Great Britain, the decimal point is represented by a comma. A thin, fixed space, not a comma, separates groups of three digits, whether to the left or to the right of the decimal point. (In electronic publications, a nonbreaking space may be used.) This practice reflects European-style SI usage (see 9.56). Canadians increasingly follow SI usage, retaining the decimal point (or, in French-language contexts, the comma) but using a thin space to separate groups of three digits. In US publications, US style should be followed, except in direct quotations. See also 10.61.

36 333,333 (European style)
36 333.333 (Canadian style)
36,333.333 (US and British style)
- - - - -

Speaking of other languages... languages like Polish/Czech have typographical rules on "single letters" at the end of lines:

TeX Stack Exchange: "one-letter word at the end of line"

So if dealing with those types of books, you may need to use non-breaking spaces. (Although this stuff really should be dealt with at the device/layout-algorithm level.)

Last edited by Tex2002ans; 03-30-2021 at 04:01 PM.
Tex2002ans is offline   Reply With Quote
Old 03-30-2021, 12:23 PM   #11
Brett Merkey
Not Quite Dead
Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.
 
Posts: 189
Karma: 654170
Join Date: Jul 2015
Device: Paperwhite 4; Galaxy Tab
It is clear that the last post of Tex2002ans went beyond the call of duty and should be enshrined as the authoritative source on the usage of non-breaking spaces in e-book construction. The Internet Gods will be duly notified.
Brett Merkey is offline   Reply With Quote
Old 03-30-2021, 04:13 PM   #12
phossler
Wizard
phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.
 
Posts: 1,071
Karma: 412718
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
Quote:
Originally Posted by Brett Merkey View Post
It is clear that the last post of Tex2002ans went beyond the call of duty and should be enshrined as the authoritative source on the usage of non-breaking spaces in e-book construction. The Internet Gods will be duly notified.
Agreed

I'm still reading and digesting the last one

His other post and the references also need to be enshrined

------------------------------------------------------------------------------

Now what the Calibre world needs is for a smart PI developer to capture the logic and options since RegEx would seem to be limited

Last edited by phossler; 03-30-2021 at 04:17 PM.
phossler is offline   Reply With Quote
Old 03-30-2021, 05:35 PM   #13
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Brett Merkey View Post
It is clear that the last post of Tex2002ans went beyond the call of duty and should be enshrined as the authoritative source on the usage of non-breaking spaces in e-book construction. The Internet Gods will be duly notified.


Quote:
Originally Posted by phossler View Post
Agreed

I'm still reading and digesting the last one

His other post and the references also need to be enshrined
All my ebook knowledge will be compiled/enshrined on my blog soon* (see my signature).

- - -

* ... eventually... (grumble grumble, two+ years later... where the hell is this time flying? )

Quote:
Originally Posted by phossler View Post
Now what the Calibre world needs is for a smart PI developer to capture the logic and options since RegEx would seem to be limited
Bits and pieces already exist...

But the reason why non-breaking spaces work better in Print is because they have access to:
  • Justification at the paragraph level
  • Microtypography
  • Hyphenation

and you know exactly where text will land, so you can manually adjust if needed.

Ebooks do:

This is why it's easier to just use simple normal spaces in nearly all cases.

Adding &nbsp; willynilly all over the place, while it might keep some things together, would lead to horrible tradeoffs elsewhere. (Even making it horrible to read/search/spellcheck your code!)

Side Note: Yes, if the language rules demand certain spacing (like French with their thin spaced « guillemets »), a &nbsp; in your ebook would be "acceptable"... but tread lightly.

Last edited by Tex2002ans; 03-30-2021 at 05:41 PM.
Tex2002ans is offline   Reply With Quote
Old 03-31-2021, 10:09 AM   #14
phossler
Wizard
phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.
 
Posts: 1,071
Karma: 412718
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
Quote:
Originally Posted by Tex2002ans View Post


All my ebook knowledge will be compiled/enshrined on my blog soon* (see my signature).

- - -

* ... eventually... (grumble grumble, two+ years later... where the hell is this time flying? )


Adding &nbsp; willynilly all over the place, while it might keep some things together, would lead to horrible tradeoffs elsewhere. (Even making it horrible to read/search/spellcheck your code!)
1. Looking forward to the blog

2. Mostly agree that &nbsp; everywhere would most likely mess up the reflow on an ereader, there are still some constructs that IMHO really do need a &nbsp; so that you don't REALLY bad line breaks:

Quote:
blah blah blah and gave it to Dr.<nl>
Smith to blah blah
Since I don't this stuff for a living or for others (just me and my Kindle [sounds like a country western song]) 'house style' = 'my style' and if I decide I don't like it it's easy 'nuff to undo it. I've changed my mind many times to remove what seemed like a good thing at the time

3. Hope you get the blog online soon (Digital Slug??)
phossler is offline   Reply With Quote
Old 03-31-2021, 05:50 PM   #15
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by phossler View Post
2. Mostly agree that &nbsp; everywhere would most likely mess up the reflow on an ereader, there are still some constructs that IMHO really do need a &nbsp; so that you don't REALLY bad line breaks:

Code:
blah blah blah and gave it to Dr.<nl>
Smith to blah blah
Yep, but I only apply it in Print.

Mr./Mrs./Dr. are my biggest ones, so I sometimes just use a simple regex:

Search: (Mrs?|Drs?)\. ([A-Z])
Replace: \1.&nbsp;\2

I was digging through LanguageTool, and here's a more comprehensive one they use:

Search: \b(Atty|Sg?t|[SG]en|Ft|Gov|Hon|Prof|Mr?s|Mt|[DMJS]r|Col|Maj|L(ieu)?t|Brig|Capt|Cmdr|Cmnd|Revd?|Rep)\ .\s[A-Z]

That should also cover Prof./Col./Capt./Gov./Rev. and others.

* * *

But you have to think:

What are the actual chances of "Mr." or "p." landing at the end of a line?
  • >95%+ exist within the line
  • Only a handful are going to land at the very end of a line.
    • In ebooks, probably more likely, since it'll be read on skinnier devices + larger fonts

but we're still talking about a very small percentage.

So, here's a real-life example from a history book I typeset last year (189k words, 595 pages).

Left = Normal Spaces throughout
Right = No-Break Spaces

1 "p." + 1 "F.M. Last":

Click image for larger version

Name:	p.087[Default].png
Views:	152
Size:	299.7 KB
ID:	186284Click image for larger version

Name:	p.087[NoBreakSpaces].png
Views:	158
Size:	299.0 KB
ID:	186285

1 "p." + 1 "pp.":

Click image for larger version

Name:	p.270[Default].png
Views:	147
Size:	304.1 KB
ID:	186288Click image for larger version

Name:	p.270[NoBreakSpaces].png
Views:	159
Size:	318.6 KB
ID:	186289

1 "No.":

Click image for larger version

Name:	p.490[Default].png
Views:	153
Size:	313.7 KB
ID:	186290Click image for larger version

Name:	p.490[NoBreakSpaces].png
Views:	153
Size:	312.1 KB
ID:	186291

And see if you can spot the no-break space here:

Click image for larger version

Name:	p.218[Default].png
Views:	150
Size:	282.5 KB
ID:	186286Click image for larger version

Name:	p.218[NoBreakSpace].png
Views:	145
Size:	287.0 KB
ID:	186287

There were ~6000 no-break spaces added throughout the book:
  • Most land in the middle of text, barely perceptible nudging.
  • Only a small % of those actually ever land at the end of a line.

And as you can see in the 218 example above, paragraph-level justification + hyphenation automagically takes care of the vast majority of THOSE, so you see even LESS spacing/line-breaking problems. (A percentage of a percentage.)

Side Note: I searched the book for "p." + "pp.":
  • 1425 total.
  • 21 fell at the end of a line.

~1.5%.

The % is probably similar for all the other categories I listed earlier too.

So out of 6000 cases, ~90 might've made a readable difference.

(There was actually 1 case of "Sen." ending a page. Now that was an egregious issue.)

Quote:
Originally Posted by phossler View Post
Since I don't this stuff for a living or for others (just me and my Kindle [sounds like a country western song]) 'house style' = 'my style' and if I decide I don't like it it's easy 'nuff to undo it. I've changed my mind many times to remove what seemed like a good thing at the time
In ebooks, I kind of relate it to Soft Hyphens.

Sure, you can run the HyphenateThis plugin to try to correct for no/bad hyphenation on some devices... but definitely don't use it in a book you want to publish.

Quote:
Originally Posted by phossler View Post
3. Hope you get the blog online soon (Digital Slug??)
Yep, that's the title.

Quote:
Originally Posted by phossler View Post
1. Looking forward to the blog
Me too.

I've been prepping for quite a while (hence compiling/referencing all these links to old topics).

Plus I've been re-pumping myself up the past few weeks. I need to kick everything back into gear.

Last edited by Tex2002ans; 03-31-2021 at 06:41 PM.
Tex2002ans is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regex to insert images automatically Pippo53s03 Sigil 12 05-29-2020 01:01 AM
Adding books - filename RegEx author FN (initials) LN Rob557 Library Management 8 10-16-2014 05:53 PM
Regex to insert word at beginning of a line macnab69 Library Management 1 05-20-2013 03:56 AM
Regex Help - Add Book - Insert Filename into Custom Field nynaevelan Library Management 3 10-12-2011 02:14 PM
Insert new line with regex deckoff Sigil 6 08-08-2010 12:24 PM


All times are GMT -4. The time now is 07:49 AM.


MobileRead.com is a privately owned, operated and funded community.