![]() |
#1 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
RegEx: Insert nbs between initials, etc.
Looking to see if there is a RegEx that will help to insert non-breaking spaces to join names, etc. so as to avoid a line wrap in the middle
For example. in the fragment below, I'd like to insert NBS after the 's' and before the "B' so that 'Thomas J. Beale' doesn't line wrap Quote:
#1 - A. B. Charles -- > A.nbsB.nbsC Find: ([A-Z])\. ([A-Z])\. ([A-Z]) Replace: \1.\xA0\2.\3 #2 - A. Charles --> A.nbsC Find: ([A-Z])\. ([A-Z]) Replace: \1.\xA0\2 The problem -- If a sentence ends with an upper case letter, it finds it blah blah end of sentence T. Next sentence Is there a better way to do the RE? Eventually I'd like to extend this to joining dates. Maybe some others March 22, 2021 --> Marchnbs22,nbs2021 |
|
![]() |
![]() |
![]() |
#2 |
Evangelist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 450
Karma: 3886916
Join Date: May 2013
Location: Ontario, Canada
Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma
|
You are on the right track, but in this case I can't imagine a search where you would want to run a "replace all". The possibilities of non-name text fitting the search are pretty large no matter what. No regex string can tell if a word is a person's name.
For the "Thomas J. Beale" sorts of case, ([a-z])\. ([A-Z])\. ([A-Z]) with match case checked should work OK. Your other searches look OK, but I'd do "find and replace" throughout the book for any of them rather than "replace all". In your "A. B. Charles" case, I think you want another nbsp to go be tween the B and C in the replace string. |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 31,046
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
How about A.M. or P.M.
Really, do not use replace all. You can verify every Find (a skip)/Replace Find pretty fast. I keep Preview up on a monitor to validate 'context' issues |
![]() |
![]() |
![]() |
#4 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
Quote:
![]() @retiredbiker @theducks While I'd like to be able to do a [Replace All] I didn't trust my RE, so I figured I'd ask more experienced users if there was more advanced way ![]() A [Replace and find] takes longer but is safer ![]() |
|
![]() |
![]() |
![]() |
#5 | ||||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Yep, that sounds about right.
Quote:
That's ~ the regex I use as well... except I use regex to normalize "First. Middle." into a single chunk: F. A. Hayek -> F.A. Hayek W. E. B. Du Bois -> W.E.B. Du Bois or normalizing states/acronyms/times: C. A. -> C.A. N. Y. C. -> N.Y.C. A. M. -> A.M. Quote:
Regex alone is "too dumb". To lower the errors, you'd need something that can actually parse the sentence structure. Antidote is a grammarchecker, and is the only one I know of that can detect/combine First + Middle + Last Name (along with units + dates/times + [...]). See their list of space detections: https://documentation.antidote.info/...s/spaces-panel Antidote was designed for French first, where "non-breaking thin spaces" are used all over the place around punctuation. Side Note: I wrote a detailed analysis of Antidote in: I also discussed a few similar regexes over the years (like ALL CAPS->Smallcaps or Roman Numerals):
Side Note #2: You may also be able to hackishly use Spellcheck Lists:
I explained multiple methods to combine "e m p h a s i s" into "emphasis". * Note: What I wrote in Post #12 in the topic above still applies: Quote:
![]() Quote:
Search: (Jan|Feb|Mar|Apr|Aug|Sept|Oct|Nov|Dec)\. (\d) Search: (January|February|March|April|May|June|July|August |September|October|November|December) (\d{1,2}), They can be adjusted as needed. Last edited by Tex2002ans; 03-28-2021 at 07:32 PM. |
||||
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
Quote:
Q1: Are there rules or is it personal choice that F. A. Hayek -> F.A. Hayek is correct? I typically use F. A. Hayek with spaces Q2: I'd think that CA and NYC would be the correct acronym? Q3: Related to Q1. I'm on the fence about A.M. vs A. M.. I think I prefer the no-space version, but are there any rules? |
|
![]() |
![]() |
![]() |
#7 |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 31,046
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
I have 3 names: Fn Mn Ln not FnMnLn or Fi.Mi.Ln
so I take it Personal when you get it wrong ![]() (My real initials form another common nickname if not separated) I use the Quality Check PI to Fix authors initials to all have spaces |
![]() |
![]() |
![]() |
#8 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
So would you be/prefer
Fi. Mi. Ln or Fi.Mi. Ln ? |
![]() |
![]() |
![]() |
#9 |
Well trained by Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 31,046
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
|
![]() |
![]() |
![]() |
#10 | |||||||||||||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
![]() Personally, in ebooks, I would avoid using these non-breaking spaces. They're going to cause more headaches than they're worth. But for Print, see some of the notes below. ![]() Side Note: In ebooks, the only non-breaking spaces I use is in 3- and 4+-dot ellipses. Quote:
But this all comes down to preferences (Style Guides / House Styles). For example, the publishers I do a lot of work for prefer the "F.M. Last" form. If I run across a book where 90% are one way, and 10% are the other way, better to choose one and make it consistent (and I personally lean towards no space). If the book is 100% correct though, then I don't mess with it. ![]() - - - - - Pass #1: Take care of the "easy". Search: (\b[A-Z]\.) ([A-Z]\.) ([A-Z]) Replace: \1\2 \3 This catches "F. M. Last" -> "F.M. Last". Pass #2: Take care of the "hard" + all the rest. Search: (\b[A-Z]\.) ([A-Z]) Replace: \1 \2 This will catch "First M. Last" -> "First M. Last". Exceptions Depending on how the rest of the book deals with acronyms, there may be very few errors, or a ton. Here are a few examples you may run across: I ran this on a ~2 million word journal. Pass #1: Many styles shorten journal/newspaper names: Quote:
Pass #2: U.S. (1 of these is a sentence-ender): Quote:
B.C. (sentence-ender?): Quote:
Quote:
Side Note: LanguageTool detects some of this stuff in the backend while parsing sentences so it can recommend proper capitalization, but doesn't (yet) recommend non-breaking spaces like Antidote: https://languagetool.org/ And tools like NTLK (Natural Language Toolkit) may detect even more of these cases: https://www.nltk.org/index.html Depends on the source. Again, consistency within a document/book matters... but, you could have: Book titles: Quote:
Quote:
Quote:
Quote:
https://www.mobileread.com/forums/sh...38#post3978438 (And I linked to a Wikipedia post describing some styles/conventions around the world.) Side Note: Of course, just use the proper 24-hour clock and you won't have to deal with the am/pm nonsense! ![]() If you're typesetting a physical print book though, there are some typographical guidelines you may want to follow. For example: Titles:
Saint/Street:
Roman Numerals:
Chapter/Part/Appendix:
Pages:
Volume:
Editions:
Ordinals:
Variables:
Equation/Figure Numbers:
Units:
Maths/Equations*:
et al.
For a few more examples, see: TeX Stack Exchange: "When should I use non-breaking space?" * Note: Proper maths typography is a whole other can of worms... ![]() - - - - - You could also skim Robert Bringhurst's "The Elements of Typographic Style" or relevant Style Guides. For example, Bringhurst's Chapter 2.1.5 recommends: Quote:
Quote:
Quote:
Speaking of other languages... languages like Polish/Czech have typographical rules on "single letters" at the end of lines: TeX Stack Exchange: "one-letter word at the end of line" So if dealing with those types of books, you may need to use non-breaking spaces. (Although this stuff really should be dealt with at the device/layout-algorithm level.) Last edited by Tex2002ans; 03-30-2021 at 03:01 PM. |
|||||||||||||
![]() |
![]() |
![]() |
#11 |
Not Quite Dead
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 195
Karma: 654170
Join Date: Jul 2015
Device: Paperwhite 4; Galaxy Tab
|
It is clear that the last post of Tex2002ans went beyond the call of duty and should be enshrined as the authoritative source on the usage of non-breaking spaces in e-book construction. The Internet Gods will be duly notified.
|
![]() |
![]() |
![]() |
#12 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
Quote:
I'm still reading and digesting the last one His other post and the references also need to be enshrined ------------------------------------------------------------------------------ Now what the Calibre world needs is for a smart PI developer to capture the logic and options since RegEx would seem to be limited Last edited by phossler; 03-30-2021 at 03:17 PM. |
|
![]() |
![]() |
![]() |
#13 | |||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
![]() Quote:
- - - * ... eventually... (grumble grumble, two+ years later... where the hell is this time flying? ![]() Quote:
But the reason why non-breaking spaces work better in Print is because they have access to:
and you know exactly where text will land, so you can manually adjust if needed. Ebooks do:
This is why it's easier to just use simple normal spaces in nearly all cases. Adding willynilly all over the place, while it might keep some things together, would lead to horrible tradeoffs elsewhere. (Even making it horrible to read/search/spellcheck your code!) Side Note: Yes, if the language rules demand certain spacing (like French with their thin spaced « guillemets »), a in your ebook would be "acceptable"... but tread lightly. ![]() Last edited by Tex2002ans; 03-30-2021 at 04:41 PM. |
|||
![]() |
![]() |
![]() |
#14 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
Quote:
2. Mostly agree that everywhere would most likely mess up the reflow on an ereader, there are still some constructs that IMHO really do need a so that you don't REALLY bad line breaks: Quote:
3. Hope you get the blog online soon (Digital Slug??) |
||
![]() |
![]() |
![]() |
#15 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Mr./Mrs./Dr. are my biggest ones, so I sometimes just use a simple regex: Search: (Mrs?|Drs?)\. ([A-Z]) Replace: \1. \2 I was digging through LanguageTool, and here's a more comprehensive one they use: Search: \b(Atty|Sg?t|[SG]en|Ft|Gov|Hon|Prof|Mr?s|Mt|[DMJS]r|Col|Maj|L(ieu)?t|Brig|Capt|Cmdr|Cmnd|Revd?|Rep)\ .\s[A-Z] That should also cover Prof./Col./Capt./Gov./Rev. and others. * * * But you have to think: What are the actual chances of "Mr." or "p." landing at the end of a line?
but we're still talking about a very small percentage. So, here's a real-life example from a history book I typeset last year (189k words, 595 pages). Left = Normal Spaces throughout Right = No-Break Spaces 1 "p." + 1 "F.M. Last": 1 "p." + 1 "pp.": 1 "No.": And see if you can spot the no-break space here: There were ~6000 no-break spaces added throughout the book:
And as you can see in the 218 example above, paragraph-level justification + hyphenation automagically takes care of the vast majority of THOSE, so you see even LESS spacing/line-breaking problems. (A percentage of a percentage.) ![]() Side Note: I searched the book for "p." + "pp.":
~1.5%. The % is probably similar for all the other categories I listed earlier too. So out of 6000 cases, ~90 might've made a readable difference. (There was actually 1 case of "Sen." ending a page. Now that was an egregious issue.) Quote:
Sure, you can run the HyphenateThis plugin to try to correct for no/bad hyphenation on some devices... but definitely don't use it in a book you want to publish. Yep, that's the title. Me too. ![]() I've been prepping for quite a while (hence compiling/referencing all these links to old topics). Plus I've been re-pumping myself up the past few weeks. I need to kick everything back into gear. Last edited by Tex2002ans; 03-31-2021 at 05:41 PM. |
||
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Regex to insert images automatically | Pippo53s03 | Sigil | 12 | 05-29-2020 12:01 AM |
Adding books - filename RegEx author FN (initials) LN | Rob557 | Library Management | 8 | 10-16-2014 04:53 PM |
Regex to insert word at beginning of a line | macnab69 | Library Management | 1 | 05-20-2013 02:56 AM |
Regex Help - Add Book - Insert Filename into Custom Field | nynaevelan | Library Management | 3 | 10-12-2011 01:14 PM |
Insert new line with regex | deckoff | Sigil | 6 | 08-08-2010 11:24 AM |