Too long Paragraphs

Pippo53s03 · 02-03-2020, 05:42 AM

Can you suggest a regular expression for truncating paragraphs longer than 1000 characters? it doesn't matter if blank spaces are included in calculation or not.
thanks!

Doitsu · 02-03-2020, 08:56 AM

If the text is wrapped in paragraph tags, the following quick & dirty regex should work:

Search for:

Code:

<p>(.{999}).*?</p>

Replace with:

Code:

<p>\1…</p>

RbnJrg · 02-03-2020, 09:19 AM

Quote:

Originally Posted by Pippo53s03

Can you suggest a regular expression for truncating paragraphs longer than 1000 characters? it doesn't matter if blank spaces are included in calculation or not.
thanks!

Hmmm, I'm not sure but maybe this can help you:

Find: (.{999})([^ ]+)(\s)(.)
Replace: \1\2\n\n\4

The above will find any string of 1000 chars and a few chars more until to find an space (because I suppose you won't want to split the paragraph in the middle of a word). But first you must select the text where you want to do the S&R (otherwise the regex will work also in the header of the .xhtml file) or don't select the "wrap" option, set the pointer after the body tag and do the S&R in the current file.

EDIT: If you want to find any string of 1000 chars plus a few more chars until to find a ". " (that would mean the end of a sentence) then use the following:

Find: (.{999})([^ ]+)(\.\s)(.)
Replace: \1\2.\n\n\4

Pippo53s03 · 02-03-2020, 09:56 AM

Quote:

Find: (.{999})([^ ]+)(\.\s)(.)
Replace: \1\2.\n\n\4

Last Regular expression is MAGIC!

Thank you very much.

JSWolf · 02-03-2020, 09:59 AM

Why do you want to break up long paragraphs? That spoils the book. Makes it look like a bad PDF conversion.

RbnJrg · 02-03-2020, 10:56 AM

Quote:

Originally Posted by Pippo53s03

Last Regular expression is MAGIC!

Thank you very much.

You are welcome

DiapDealer · 02-03-2020, 10:57 AM

Quote:

Originally Posted by JSWolf

Why do you want to break up long paragraphs? That spoils the book. Makes it look like a bad PDF conversion.

Not relevant, Jon. Please answer questions or stay out of the conversation.

Tex2002ans · 02-03-2020, 01:51 PM

Quote:

Originally Posted by JSWolf

Why do you want to break up long paragraphs?

I used something similar to find very long quotations within paragraphs.

The author didn't use many blockquotes, so I looked for an opening quote + X many characters (400-800+) until a closing quote:

Search: “([^”<]{800,})”
Replace: <blockquote>\1</blockquote> 

Then I was able to easily replace:

Code:

<p>Paragraph with “a super duper [...] long quotation” in the middle.</p>

with:

Code:

<p>Paragraph with</p>
<blockquote><p>a super duper [...] long quotation</p></blockquote>
<p>in the middle.</p>

(Certain Style Guides like the CMOS require quotations of X or more words/lines reformatted as blockquotes.)

Notjohn · 02-03-2020, 02:35 PM

Soon after I began publishing through Amazon's DTP/KDP in November 2007, I realized that paragraphs in an ebook had to be much shorter than those in a print edition. So I began to use my right-hand pinkie much more industriously, ensuring that the paragraphs occurred at least once on every digital "page". Pippo won't have to use regex on me!

Brett Merkey · 02-03-2020, 07:23 PM

@RbnJrg

Very nice regex. A keeper. I tested it in Calibre with 300 chars instead of 999 and the result looked much nicer than I had expected. I expected all the paragraphs to appear too similar in length, which is not cool. However, (esp. with "Dot All") the number of screen lines per paragraph varied in a nice way—I am not sure why.

I encounter lots of books (history and the sciences) by learned people who do not believe in paragraphs, among their many e-book formatting crimes. Your regex will be added to my clean-up stack where breaking up a mass of text has benefits even tho some para breaks may not be precisely correct in terms of conventions.

RbnJrg · 02-04-2020, 06:17 AM

Quote:

Originally Posted by Brett Merkey

@RbnJrg

Very nice regex. A keeper.

Thank you, glad that the code was able to help you too.

Quote:

I tested it in Calibre with 300 chars instead of 999 and the result looked much nicer than I had expected. I expected all the paragraphs to appear too similar in length, which is not cool. However, (esp. with "Dot All") the number of screen lines per paragraph varied in a nice way—I am not sure why.

Take in count that the regex will split the paragraphs not exactly in 300 chars (your case) but with a "few" chars more (until to find the end of the sentence). So, in some cases, those "few" more chars could be 30-50 more letters.

And also affect the output the place where the previous split was done. Suppose the case when after splitting, the ammount of letters before reaching the next is 150. Then the next paragraph would splitted at with 150 chars (more or less) instead of 300. For that reason the paragraphs don't have all the same lenght.

02-03-2020, 05:42 AM	#1
Pippo53s03 Member Posts: 19 Karma: 10 Join Date: Apr 2015 Device: Kobo Aura H2O 2ed	Too long Paragraphs Can you suggest a regular expression for truncating paragraphs longer than 1000 characters? it doesn't matter if blank spaces are included in calculation or not. thanks!

02-03-2020, 08:56 AM	#2
Doitsu Grand Sorcerer Posts: 5,680 Karma: 23983815 Join Date: Dec 2010 Device: Kindle PW2	If the text is wrapped in paragraph tags, the following quick & dirty regex should work: Search for: Code: <p>(.{999}).?</p> Replace with:* Code: <p>\1…</p>

02-03-2020, 07:23 PM	#10
Brett Merkey Not Quite Dead Posts: 195 Karma: 654170 Join Date: Jul 2015 Device: Paperwhite 4; Galaxy Tab	@RbnJrg Very nice regex. A keeper. I tested it in Calibre with 300 chars instead of 999 and the result looked much nicer than I had expected. I expected all the paragraphs to appear too similar in length, which is not cool. However, (esp. with "Dot All") the number of screen lines per paragraph varied in a nice way—I am not sure why. I encounter lots of books (history and the sciences) by learned people who do not believe in paragraphs, among their many e-book formatting crimes. Your regex will be added to my clean-up stack where breaking up a mass of text has benefits even tho some para breaks may not be precisely correct in terms of conventions. Last edited by Brett Merkey; 02-03-2020 at 07:30 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Calibre truncates long names on Win10 with long paths enabled.	maximx86	Calibre	14	01-24-2024 05:54 AM
PRS-T1 Long paragraphs cause unwanted page breaks on PRS-T1	entodoays	Sony Reader	2	03-11-2014 06:21 AM
Arbitrary breaks in long paragraphs	vampiregrave	ePub	54	10-26-2013 11:42 AM
Touch Long paragraphs and footnotes	AlexBell	Kobo Reader	5	08-23-2013 07:31 AM
Calibre taking a long, long time to update metadata on sony prs650	hydin	Calibre	5	06-05-2012 12:21 AM

02-03-2020, 09:59 AM	#5
JSWolf Resident Curmudgeon Posts: 79,012 Karma: 144284074 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	Why do you want to break up long paragraphs? That spoils the book. Makes it look like a bad PDF conversion.

02-03-2020, 02:35 PM	#9
Notjohn mostly an observer Posts: 1,518 Karma: 987654 Join Date: Dec 2012 Device: Kindle	Soon after I began publishing through Amazon's DTP/KDP in November 2007, I realized that paragraphs in an ebook had to be much shorter than those in a print edition. So I began to use my right-hand pinkie much more industriously, ensuring that the paragraphs occurred at least once on every digital "page". Pippo won't have to use regex on me!

Advert

Advert