Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 10-18-2022, 08:46 AM   #1
ColMac
Connoisseur
ColMac began at the beginning.
 
Posts: 59
Karma: 10
Join Date: Apr 2012
Device: Kindle Fire
False paragraph breaks & RegEx

Wonder if someone can help an absolute beginner.

I have a number of books that contain many false paragraph breaks part way through a sentence.

An example of the type of code that appears is

</p>

<p class="calibre7">

This of course is the valid code for a correctly placed paragraph break in this book.

It normally appears with no space after the last word, and no space before the next word, and in most (I'm reluctant to say all) cases the last character before the error will be lower case. However, the first character of the following word may well be upper case when a name is used for example.

Would there be a way to check for that code appearing (and Calibre 7 would presumably need to be a variable) when not following the common punctuation marks?

Thanks in hope, as this is way beyond my skills.

Colin
ColMac is offline   Reply With Quote
Old 10-18-2022, 08:52 AM   #2
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,349
Karma: 129333690
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Is this a Calibre conversion from a PDF? Can you share some of the code?
JSWolf is offline   Reply With Quote
Advert
Old 10-18-2022, 09:10 AM   #3
ColMac
Connoisseur
ColMac began at the beginning.
 
Posts: 59
Karma: 10
Join Date: Apr 2012
Device: Kindle Fire
Quote:
Originally Posted by JSWolf View Post
Is this a Calibre conversion from a PDF? Can you share some of the code?
Unfortunately I have no idea. I am looking at them as epubs in Calibre.

I have just manually edited the few errors I spotted in the book I'm reading, so I don't have any live examples to show, and a search obviously is no help.

I should be able to find an example in a few days in my next book
ColMac is offline   Reply With Quote
Old 10-18-2022, 09:35 AM   #4
gbm
Wizard
gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.
 
Posts: 2,085
Karma: 8796704
Join Date: Jun 2010
Device: Kobo Clara HD,Hisence Sero 7 Pro RIP, Nook STR, jetbook lite
Look in the sticky thread--Saved Search/Regex Functions--at the top of this editor forum, I believe the first post will have the help you need.



bernie
Quote:
Originally Posted by ColMac View Post
Wonder if someone can help an absolute beginner.

I have a number of books that contain many false paragraph breaks part way through a sentence.

An example of the type of code that appears is

</p>

<p class="calibre7">

This of course is the valid code for a correctly placed paragraph break in this book.

It normally appears with no space after the last word, and no space before the next word, and in most (I'm reluctant to say all) cases the last character before the error will be lower case. However, the first character of the following word may well be upper case when a name is used for example.

Would there be a way to check for that code appearing (and Calibre 7 would presumably need to be a variable) when not following the common punctuation marks?

Thanks in hope, as this is way beyond my skills.

Colin
gbm is offline   Reply With Quote
Old 10-18-2022, 10:03 AM   #5
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 74,349
Karma: 129333690
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by ColMac View Post
Unfortunately I have no idea. I am looking at them as epubs in Calibre.

I have just manually edited the few errors I spotted in the book I'm reading, so I don't have any live examples to show, and a search obviously is no help.

I should be able to find an example in a few days in my next book
Where did this ePub come from?
JSWolf is offline   Reply With Quote
Advert
Old 10-18-2022, 10:53 AM   #6
Sarmat89
Evangelist
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 483
Karma: 2267928
Join Date: Nov 2015
Device: none
First, try this in replace all mode:
Code:
(?<=[\p{Ll},;]) *</p>\s*<p class="[^"]*">(?=\p{Ll})
, replacing this with a space.
Then, remove (?=\p{Ll}) part, and do manual replacements troughout the file.
Then, restore that part, and remove the (?<=....) one; continue replacement with manual confirmations.
Sarmat89 is offline   Reply With Quote
Old 10-21-2022, 12:55 AM   #7
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by ColMac View Post
Wonder if someone can help an absolute beginner.

I have a number of books that contain many false paragraph breaks part way through a sentence.

An example of the type of code that appears is

</p>

<p class="calibre7">

This of course is the valid code for a correctly placed paragraph break in this book.

It normally appears with no space after the last word, and no space before the next word, and in most (I'm reluctant to say all) cases the last character before the error will be lower case. However, the first character of the following word may well be upper case when a name is used for example.
I have explained this exact "broken paragraphs" question many times over the years.

Here are 4 such topics where I go step-by-step and break down the Regular Expressions:

I even discussed it way back in:

(Of course, my newer methods are better and fix more things, but those regex are still useful to see/learn from.)

Last edited by Tex2002ans; 10-21-2022 at 01:42 AM.
Tex2002ans is offline   Reply With Quote
Old 10-21-2022, 10:25 AM   #8
enuddleyarbl
Guru
enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.enuddleyarbl ought to be getting tired of karma fortunes by now.
 
enuddleyarbl's Avatar
 
Posts: 734
Karma: 1077122
Join Date: Sep 2013
Device: Kobo Forma
Quote:
Originally Posted by Tex2002ans View Post
I have explained this exact "broken paragraphs" question many times over the years.

Here are 4 such topics where I go step-by-step and break down the Regular Expressions:
[LIST][*] 2021: "Regex examples"
  • (Especially my Post #689+.)
...
From that 2021 post:
Quote:
Search: -</p>\s+<p>
Replace: <--- (Completely blank)

and:

Search: ([^>”\?\!\.])</p>\s+<p>
Replace: \1 <---- (There's a space after the '1')

and:

Search: <p>[a-z]
Replace: <---- (BLANK. Only use for FINDING, NOT REPLACING.)
I don't know if Sigil is different from Calibre in this regard (the post is in the Sigil forum), but if the last search (for lowercase starting paragraphs) is saved in Calibre, make sure to check the "Case Sensitive" box.

Also, a question for the first ("hyphen") rule. Most of the books I edit have a tendency to end interrupted paragraphs with a dash:
Code:
<p>"Here I am wal-</p>
<p>The monster leaped out and ate my face.</p>
Stylistically, should the dash be replaced with something else (to signify an interruption)? Or, just leave it?

EDIT: And, wouldn't the first rule be covered by the second? Are you just separating them to pull the hyphen issues out of the mass?
enuddleyarbl is offline   Reply With Quote
Old 10-21-2022, 01:07 PM   #9
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,325
Karma: 85874895
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Should be an em dash with no space but closing quote for cut-off dialogue. Depends on style guide. We use en dash surrounded by spaces when it's parenthetical and like a comma the 2nd one is omitted at the end of a sentence unlike (actual brackets). Many USA style guides use em dash without spaces inside a sentence for parenthesis. An actual hyphen shouldn't be an en or em dash and should only have spaces either side if it's a range.
Actual hyphens come in at least three kinds:
1) A word broken at a suitable point to wrap at the margin
2) A word that is commonly using a hyphen. Ones that are not the same vowel repeated drop the hyphen over the years, so today used to be to-day.
3) Suffixed words, as in "His face was fish-like".

Obviously in removing excess hyphens 2 & 3 must be left and the era of writing considered to see if hyphens of type 2 are valid.
Quoth is offline   Reply With Quote
Old 10-21-2022, 03:00 PM   #10
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by enuddleyarbl View Post
I don't know if Sigil is different from Calibre in this regard (the post is in the Sigil forum), but if the last search (for lowercase starting paragraphs) is saved in Calibre, make sure to check the "Case Sensitive" box.
Yes, by default, Calibre's Regex is case insensitive (why that is, I'm not sure).

In Calibre, make you you have that "Case Sensitive" box checked.

In Sigil, all you have to do is make sure you are in "Regex" mode.

Quote:
Originally Posted by enuddleyarbl View Post
Also, a question for the first ("hyphen") rule. Most of the books I edit have a tendency to end interrupted paragraphs with a dash:
Code:
<p>"Here I am wal-</p>
<p>The monster leaped out and ate my face.</p>
Stylistically, should the dash be replaced with something else (to signify an interruption)? Or, just leave it?
When conversations get cut off, the proper character to use is an:
  • — = EM DASH
    • (It is about the size of an 'm'.)

I recommend reading this great article in Wikipedia showing different examples:

and my many writings over the years:

- - -

In the case where you have a wrong/bad interruption:

Code:
<p>"Here I am wal-</p>
<p>The monster leaped out and ate my face.</p>
you would use this Regex:

Find: -</p>
Replace: —</p>

(Replace a HYPHEN at the end of a paragraph with an EM DASH.)

That will get you the correct:

Code:
<p>"Here I am wal—</p>
<p>The monster leaped out and ate my face.</p>
Note: DO NOT ever do a "Replace All" though, you'd have to decide these on a case-by-case basis.

- - -

Side Note: Actually, your example is off. Just because your dialogue got interrupted... you'd still need the close quote:

Code:
<p>“Here I am wal—”</p>
<p>The monster leaped out and ate my face.</p>
I'll let you figure out the Replace needed for that.

Last edited by Tex2002ans; 10-21-2022 at 03:13 PM.
Tex2002ans is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Paragraph breaks become page breaks when converting to mobi Allreader Conversion 6 07-19-2021 01:08 AM
Regex Help: Find page number & Replace+Remove 2x Line Breaks in Sigil Contre-jour Sigil 9 02-01-2013 10:47 AM
about paragraph breaks arslonga Calibre 0 02-03-2012 05:03 AM
Paragraph breaks thedevilsjester Calibre 2 09-07-2010 12:26 PM
Converting from LRF: Paragraph & Line Breaks wudaben LRF 0 07-14-2010 11:32 PM


All times are GMT -4. The time now is 06:24 PM.


MobileRead.com is a privately owned, operated and funded community.