![]() |
#1 |
Connoisseur
![]() Posts: 59
Karma: 10
Join Date: Apr 2012
Device: Kindle Fire
|
False paragraph breaks & RegEx
Wonder if someone can help an absolute beginner.
I have a number of books that contain many false paragraph breaks part way through a sentence. An example of the type of code that appears is </p> <p class="calibre7"> This of course is the valid code for a correctly placed paragraph break in this book. It normally appears with no space after the last word, and no space before the next word, and in most (I'm reluctant to say all) cases the last character before the error will be lower case. However, the first character of the following word may well be upper case when a name is used for example. Would there be a way to check for that code appearing (and Calibre 7 would presumably need to be a variable) when not following the common punctuation marks? Thanks in hope, as this is way beyond my skills. Colin |
![]() |
![]() |
![]() |
#2 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 78,930
Karma: 143098300
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Is this a Calibre conversion from a PDF? Can you share some of the code?
|
![]() |
![]() |
![]() |
#3 | |
Connoisseur
![]() Posts: 59
Karma: 10
Join Date: Apr 2012
Device: Kindle Fire
|
Quote:
I have just manually edited the few errors I spotted in the book I'm reading, so I don't have any live examples to show, and a search obviously is no help. I should be able to find an example in a few days in my next book |
|
![]() |
![]() |
![]() |
#4 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,162
Karma: 8800000
Join Date: Jun 2010
Device: Kobo Clara HD,Hisence Sero 7 Pro RIP, Nook STR, jetbook lite
|
Look in the sticky thread--Saved Search/Regex Functions--at the top of this editor forum, I believe the first post will have the help you need.
bernie Quote:
|
|
![]() |
![]() |
![]() |
#5 | |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 78,930
Karma: 143098300
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
|
|
![]() |
![]() |
![]() |
#6 |
Fanatic
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 515
Karma: 2268308
Join Date: Nov 2015
Device: none
|
First, try this in replace all mode:
Code:
(?<=[\p{Ll},;]) *</p>\s*<p class="[^"]*">(?=\p{Ll}) Then, remove (?=\p{Ll}) part, and do manual replacements troughout the file. Then, restore that part, and remove the (?<=....) one; continue replacement with manual confirmations. |
![]() |
![]() |
![]() |
#7 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Here are 4 such topics where I go step-by-step and break down the Regular Expressions:
I even discussed it way back in: (Of course, my newer methods are better and fix more things, but those regex are still useful to see/learn from.) Last edited by Tex2002ans; 10-21-2022 at 01:42 AM. |
|
![]() |
![]() |
![]() |
#8 | ||
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 769
Karma: 1537886
Join Date: Sep 2013
Device: Kobo Forma
|
Quote:
Quote:
Also, a question for the first ("hyphen") rule. Most of the books I edit have a tendency to end interrupted paragraphs with a dash: Code:
<p>"Here I am wal-</p> <p>The monster leaped out and ate my face.</p> EDIT: And, wouldn't the first rule be covered by the second? Are you just separating them to pull the hyphen issues out of the mass? |
||
![]() |
![]() |
![]() |
#9 |
Still reading
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 13,632
Karma: 103503445
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper
|
Should be an em dash with no space but closing quote for cut-off dialogue. Depends on style guide. We use en dash surrounded by spaces when it's parenthetical and like a comma the 2nd one is omitted at the end of a sentence unlike (actual brackets). Many USA style guides use em dash without spaces inside a sentence for parenthesis. An actual hyphen shouldn't be an en or em dash and should only have spaces either side if it's a range.
Actual hyphens come in at least three kinds: 1) A word broken at a suitable point to wrap at the margin 2) A word that is commonly using a hyphen. Ones that are not the same vowel repeated drop the hyphen over the years, so today used to be to-day. 3) Suffixed words, as in "His face was fish-like". Obviously in removing excess hyphens 2 & 3 must be left and the era of writing considered to see if hyphens of type 2 are valid. |
![]() |
![]() |
![]() |
#10 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
In Calibre, make you you have that "Case Sensitive" box checked. In Sigil, all you have to do is make sure you are in "Regex" mode. Quote:
I recommend reading this great article in Wikipedia showing different examples: and my many writings over the years:
- - - In the case where you have a wrong/bad interruption: Code:
<p>"Here I am wal-</p>
<p>The monster leaped out and ate my face.</p>
Find: -</p> Replace: —</p> (Replace a HYPHEN at the end of a paragraph with an EM DASH.) That will get you the correct: Code:
<p>"Here I am wal—</p>
<p>The monster leaped out and ate my face.</p>
- - - Side Note: Actually, your example is off. Just because your dialogue got interrupted... you'd still need the close quote: Code:
<p>“Here I am wal—”</p>
<p>The monster leaped out and ate my face.</p>
![]() Last edited by Tex2002ans; 10-21-2022 at 03:13 PM. |
||
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Paragraph breaks become page breaks when converting to mobi | Allreader | Conversion | 6 | 07-19-2021 01:08 AM |
Regex Help: Find page number & Replace+Remove 2x Line Breaks in Sigil | Contre-jour | Sigil | 9 | 02-01-2013 10:47 AM |
about paragraph breaks | arslonga | Calibre | 0 | 02-03-2012 05:03 AM |
Paragraph breaks | thedevilsjester | Calibre | 2 | 09-07-2010 12:26 PM |
Converting from LRF: Paragraph & Line Breaks | wudaben | LRF | 0 | 07-14-2010 11:32 PM |