Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 11-02-2024, 08:05 AM   #1
_gKorg_
Junior Member
_gKorg_ began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Nov 2024
Device: Kindle
removing the paragraphs tags if paragraph starts with lower case

Hi everyone. I have an old book a friend gave me and the paragraphs are all messed up. I'm trying to clean it up and it would be amazing if there was a function like "merge with the upper paragraph if the current paragraph starts with a low-case word". Is there something like this?
Is there otherwise a way to change directly in the preview without going into the html editor? It's easier to remove a the space between paragraphs than tags...
thank you for your help
_gKorg_ is offline   Reply With Quote
Old 11-02-2024, 10:00 AM   #2
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
Code:
Find : </p>\s*<p>(\p{Ll}.*?</p>)
Replace : \x20\1
Mode : Regex
with "dot all" and "case sensitive" checked 
(\x20 is a space)
Note : it will merge only if the paragraph starts with a lower case letter. It won't if it starts with space or punctuation. If you want to give the possibility to have a space before the letter, it will give :
Code:
</p>\s*<p>(\s?\p{Ll}.*?</p>)
but you may have 2 spaces before the targeted word, which is not a problem because it won't be visible in the text (to avoid this, we could put a very simple regex-function instead of the "<space>\1" replace)

Last edited by lomkiri; 11-02-2024 at 04:28 PM.
lomkiri is offline   Reply With Quote
Advert
Old 11-02-2024, 12:31 PM   #3
mikapanja
Perfectionist
mikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
Posts: 72
Karma: 12802
Join Date: Apr 2014
Device: none
Quote:
Originally Posted by lomkiri View Post
[code]Find : </p>\s*<p>(\p{l}.*?</p>)
This works if the next paragraph starts with a simple <p>. What would be a relevant regex if it starts with a named class?
mikapanja is offline   Reply With Quote
Old 11-02-2024, 12:47 PM   #4
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
Quote:
Originally Posted by mikapanja View Post
This works if the next paragraph starts with a simple <p>. What would be a relevant regex if it starts with a named class?
You're perfectly right, I've just given the idea, it is easy then to adapt the regex.
This one will capture a <p> with classes :
Code:
</p>\s*<p[^>]*>(\p{Ll}.*?</p>)

Last edited by lomkiri; 11-02-2024 at 01:27 PM.
lomkiri is offline   Reply With Quote
Old 11-02-2024, 01:48 PM   #5
lomkiri
Groupie
lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.lomkiri ought to be getting tired of karma fortunes by now.
 
lomkiri's Avatar
 
Posts: 167
Karma: 1497966
Join Date: Jul 2021
Device: N/A
I realized that if there was a succession of several paragraphs all beginning with a lowercase letter, my regex will capture only one every two, because the pointer will stop after the </p>, so the regex won't target the next paragraph, but will go on and find only the second next one, leaving one unchanged. It would be then necessary to make various passages to target all of them in the sequence (not a big deal, but unesthetic).

This can easily be resolved if we don't capture the last </p>, but use a positive lookahead (for </p>) instead, so the pointer will stop before the </p>, and the regex is ready to capture the next paragraph if it is a candidate.

With this regex, all paragraphs will be targeted during the first passage :
Code:
</p>\s*<p[^>]*>(\p{Ll}.*?)(?=</p>)
or, if we want to target as well paragraphs starting with <space><lowercase>:
Code:
</p>\s*<p[^>]*>(\s?\p{Ll}.*?)(?=</p>)
Replace is still the same: \x20\1
(\x20 is a space)

Last edited by lomkiri; 11-02-2024 at 04:26 PM.
lomkiri is offline   Reply With Quote
Advert
Old 11-02-2024, 07:06 PM   #6
mikapanja
Perfectionist
mikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentametermikapanja can solve quadratic equations while standing on his or her head reciting poetry in iambic pentameter
 
Posts: 72
Karma: 12802
Join Date: Apr 2014
Device: none
Thanks, lomkiri!
mikapanja is offline   Reply With Quote
Old 11-07-2024, 12:14 PM   #7
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by _gKorg_ View Post
Hi everyone.
Hey, welcome to MobileRead!

Quote:
Originally Posted by _gKorg_ View Post
I have an old book a friend gave me and the paragraphs are all messed up. I'm trying to clean it up and it would be amazing if there was a function like "merge with the upper paragraph if the current paragraph starts with a low-case word". Is there something like this?
Yes:

and in this thread, specifically, I showed the exact 3 regex I've been using for 12+ years:

- - -

If you want a more "GUI-friendly way" of doing things—and you're more familiar with Word or LibreOffice—I described a lot of this stuff in:

Of course, using Sigil or Calibre and fixing it directly in the code is the best + SUPER quick (if you use those 3 regexes, that'll take care of 99% cases in a single shot!).

Trying to accomplish the same thing in Word/LibreOffice is clunky/limiting, and would take a lot more work.

Last edited by Tex2002ans; 11-07-2024 at 12:19 PM.
Tex2002ans is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Shortcut to execute title case, lower case, etc. birkmaggs Library Management 2 10-28-2018 11:42 PM
Assigning paragraph class to multiple paragraphs Leonatus Sigil 21 08-07-2013 03:29 PM
Removing spaces between paragraphs Skydog Calibre 12 02-20-2013 08:52 PM
Paragraph indent-size should not applied to centered paragraphs? ShellShock Calibre 3 01-16-2010 11:54 AM
Why are Tags all forced to lower case =X= Calibre 2 09-19-2008 02:08 PM


All times are GMT -4. The time now is 11:13 PM.


MobileRead.com is a privately owned, operated and funded community.