Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 02-03-2020, 05:42 AM   #1
Pippo53s03
Member
Pippo53s03 began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Apr 2015
Device: Kobo Aura H2O 2ed
Question Too long Paragraphs

Can you suggest a regular expression for truncating paragraphs longer than 1000 characters? it doesn't matter if blank spaces are included in calculation or not.
thanks!
Pippo53s03 is offline   Reply With Quote
Old 02-03-2020, 08:56 AM   #2
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
If the text is wrapped in paragraph tags, the following quick & dirty regex should work:

Search for:
Code:
<p>(.{999}).*?</p>

Replace with:

Code:
<p>\1…</p>
Doitsu is offline   Reply With Quote
Advert
Old 02-03-2020, 09:19 AM   #3
RbnJrg
Wizard
RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.
 
Posts: 1,542
Karma: 6613969
Join Date: Mar 2013
Location: Rosario - Santa Fe - Argentina
Device: Kindle 4 NT
Quote:
Originally Posted by Pippo53s03 View Post
Can you suggest a regular expression for truncating paragraphs longer than 1000 characters? it doesn't matter if blank spaces are included in calculation or not.
thanks!
Hmmm, I'm not sure but maybe this can help you:

Find: (.{999})([^ ]+)(\s)(.)
Replace: \1\2</p>\n\n<p>\4

The above will find any string of 1000 chars and a few chars more until to find an space (because I suppose you won't want to split the paragraph in the middle of a word). But first you must select the text where you want to do the S&R (otherwise the regex will work also in the header of the .xhtml file) or don't select the "wrap" option, set the pointer after the body tag and do the S&R in the current file.

EDIT: If you want to find any string of 1000 chars plus a few more chars until to find a ". " (that would mean the end of a sentence) then use the following:

Find: (.{999})([^ ]+)(\.\s)(.)
Replace: \1\2.</p>\n\n<p>\4

Last edited by RbnJrg; 02-03-2020 at 09:34 AM.
RbnJrg is offline   Reply With Quote
Old 02-03-2020, 09:56 AM   #4
Pippo53s03
Member
Pippo53s03 began at the beginning.
 
Posts: 19
Karma: 10
Join Date: Apr 2015
Device: Kobo Aura H2O 2ed
Quote:
Find: (.{999})([^ ]+)(\.\s)(.)
Replace: \1\2.</p>\n\n<p>\4
Last Regular expression is MAGIC!
Thank you very much.
Pippo53s03 is offline   Reply With Quote
Old 02-03-2020, 09:59 AM   #5
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 73,970
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Why do you want to break up long paragraphs? That spoils the book. Makes it look like a bad PDF conversion.
JSWolf is offline   Reply With Quote
Advert
Old 02-03-2020, 10:56 AM   #6
RbnJrg
Wizard
RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.
 
Posts: 1,542
Karma: 6613969
Join Date: Mar 2013
Location: Rosario - Santa Fe - Argentina
Device: Kindle 4 NT
Quote:
Originally Posted by Pippo53s03 View Post
Last Regular expression is MAGIC!
Thank you very much.
You are welcome
RbnJrg is offline   Reply With Quote
Old 02-03-2020, 10:57 AM   #7
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 27,549
Karma: 193191846
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by JSWolf View Post
Why do you want to break up long paragraphs? That spoils the book. Makes it look like a bad PDF conversion.
Not relevant, Jon. Please answer questions or stay out of the conversation.
DiapDealer is offline   Reply With Quote
Old 02-03-2020, 01:51 PM   #8
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by JSWolf View Post
Why do you want to break up long paragraphs?
I used something similar to find very long quotations within paragraphs.

The author didn't use many blockquotes, so I looked for an opening quote + X many characters (400-800+) until a closing quote:

Search: “([^”<]{800,})”
Replace: </p> <blockquote><p>\1</p></blockquote> <p>

Then I was able to easily replace:

Code:
<p>Paragraph with “a super duper [...] long quotation” in the middle.</p>
with:

Code:
<p>Paragraph with</p>
<blockquote><p>a super duper [...] long quotation</p></blockquote>
<p>in the middle.</p>
(Certain Style Guides like the CMOS require quotations of X or more words/lines reformatted as blockquotes.)
Tex2002ans is offline   Reply With Quote
Old 02-03-2020, 02:35 PM   #9
Notjohn
mostly an observer
Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.Notjohn ought to be getting tired of karma fortunes by now.
 
Posts: 1,515
Karma: 987654
Join Date: Dec 2012
Device: Kindle
Soon after I began publishing through Amazon's DTP/KDP in November 2007, I realized that paragraphs in an ebook had to be much shorter than those in a print edition. So I began to use my right-hand pinkie much more industriously, ensuring that the paragraphs occurred at least once on every digital "page". Pippo won't have to use regex on me!
Notjohn is offline   Reply With Quote
Old 02-03-2020, 07:23 PM   #10
Brett Merkey
Not Quite Dead
Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.Brett Merkey ought to be getting tired of karma fortunes by now.
 
Posts: 194
Karma: 654170
Join Date: Jul 2015
Device: Paperwhite 4; Galaxy Tab
@RbnJrg

Very nice regex. A keeper. I tested it in Calibre with 300 chars instead of 999 and the result looked much nicer than I had expected. I expected all the paragraphs to appear too similar in length, which is not cool. However, (esp. with "Dot All") the number of screen lines per paragraph varied in a nice way—I am not sure why.

I encounter lots of books (history and the sciences) by learned people who do not believe in paragraphs, among their many e-book formatting crimes. Your regex will be added to my clean-up stack where breaking up a mass of text has benefits even tho some para breaks may not be precisely correct in terms of conventions.

Last edited by Brett Merkey; 02-03-2020 at 07:30 PM.
Brett Merkey is offline   Reply With Quote
Old 02-04-2020, 06:17 AM   #11
RbnJrg
Wizard
RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.RbnJrg ought to be getting tired of karma fortunes by now.
 
Posts: 1,542
Karma: 6613969
Join Date: Mar 2013
Location: Rosario - Santa Fe - Argentina
Device: Kindle 4 NT
Quote:
Originally Posted by Brett Merkey View Post
@RbnJrg

Very nice regex. A keeper.
Thank you, glad that the code was able to help you too.


Quote:
I tested it in Calibre with 300 chars instead of 999 and the result looked much nicer than I had expected. I expected all the paragraphs to appear too similar in length, which is not cool. However, (esp. with "Dot All") the number of screen lines per paragraph varied in a nice way—I am not sure why.
Take in count that the regex will split the paragraphs not exactly in 300 chars (your case) but with a "few" chars more (until to find the end of the sentence). So, in some cases, those "few" more chars could be 30-50 more letters.

And also affect the output the place where the previous split was done. Suppose the case when after splitting, the ammount of letters before reaching the next </p> is 150. Then the next paragraph would splitted at with 150 chars (more or less) instead of 300. For that reason the paragraphs don't have all the same lenght.
RbnJrg is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Calibre truncates long names on Win10 with long paths enabled. maximx86 Calibre 14 01-24-2024 05:54 AM
PRS-T1 Long paragraphs cause unwanted page breaks on PRS-T1 entodoays Sony Reader 2 03-11-2014 06:21 AM
Arbitrary breaks in long paragraphs vampiregrave ePub 54 10-26-2013 11:42 AM
Touch Long paragraphs and footnotes AlexBell Kobo Reader 5 08-23-2013 07:31 AM
Calibre taking a long, long time to update metadata on sony prs650 hydin Calibre 5 06-05-2012 12:21 AM


All times are GMT -4. The time now is 01:53 AM.


MobileRead.com is a privately owned, operated and funded community.