Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 06-20-2015, 08:16 PM   #1
phossler
Wizard
phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.phossler ought to be getting tired of karma fortunes by now.
 
Posts: 1,087
Karma: 447222
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
epub with no <p>'s

Recently downloaded a epub report (I think) that might have started life as a .txt file (maybe). Example in the spoiler

Problem 1. There are no <p>...</p> tags, but just text 'paragraphs' separated by <br/>.

Problem 2. Sometimes there is <blockquote> around a block of text, but no <p> tags

Problem 3. Sometimes there are large 'blank' sections that are various combinations of PARAGRAPH SEPERATOR + SPACE or SPACE + PARAGRAPH SEPERATOR or PARAGRAPH SEPERATOR + PARAGRAPH SEPERATOR + PARAGRAPH SEPERATOR + etc. [Beautify] and [Fix HTML] don't clean them up, but a Find: \n{2,} Replace: \n does catch a lot, but leaves the spaces

Are there (probably) 3 RegEx's that will

1. Put <p> tags around the blocks of text that don't have any? I can delete the <br/>'s
2. Put <p> tags around the blocks of text inside <blockquotes>'s that don't have any? I can delete the <blockquote>'s manually if I need to
3. Remove the spaces that are outside <p>'s?


Spoiler:

Code:
<body>

<br/>
  Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Phasellus hendrerit. Pellentesque aliquet nibh nec urna. In nisi neque, aliquet vel, dapibus id, mattis vel, nisi. Sed pretium, ligula sollicitudin laoreet viverra, tortor libero sodales leo, eget blandit nunc tortor eu nibh. Nullam mollis. Ut justo. Suspendisse potenti.

<br/>

 
 

 







Sed egestas, ante et vulputate volutpat, eros pede semper est, vitae luctus metus libero eu augue. Morbi purus libero, faucibus adipiscing, commodo quis, gravida id, est. Sed lectus. Praesent elementum hendrerit tortor. Sed semper lorem at felis. Vestibulum volutpat, lacus a ultrices sagittis, mi neque euismod dui, eu pulvinar nunc sapien ornare nisl. Phasellus pede arcu, dapibus eu, fermentum et, dapibus sed, urna.

<br/>

Morbi interdum mollis sapien. Sed ac risus. Phasellus lacinia, magna a ullamcorper laoreet, lectus arcu pulvinar risus, vitae facilisis libero dolor a purus. Sed vel lacus. Mauris nibh felis, adipiscing varius, adipiscing in, lacinia vel, tellus. Suspendisse ac urna. Etiam pellentesque mauris ut lectus. Nunc tellus ante, mattis eget, gravida vitae, ultricies ac, leo. Integer leo pede, ornare a, lacinia eu, vulputate vel, nisl.

<br/>

Suspendisse mauris. Fusce accumsan mollis eros. Pellentesque a diam sit amet mi ullamcorper vehicula. Integer adipiscing risus a sem. Nullam quis massa sit amet nibh viverra malesuada. Nunc sem lacus, accumsan quis, faucibus non, congue vel, arcu. Ut scelerisque hendrerit tellus. Integer sagittis. Vivamus a mauris eget arcu gravida tristique. Nunc iaculis mi in ante. Vivamus imperdiet nibh feugiat est.

<br/>

<blockquote>Ut convallis, sem sit amet interdum consectetuer, odio augue aliquam leo, nec dapibus tortor nibh sed augue. Integer eu magna sit amet metus fermentum posuere. Morbi sit amet nulla sed dolor elementum imperdiet. Quisque fermentum. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Pellentesque adipiscing eros ut libero. Ut condimentum mi vel tellus. Suspendisse laoreet. Fusce ut est sed dolor gravida convallis. Morbi vitae ante. Vivamus ultrices luctus nunc. Suspendisse et dolor. Etiam dignissim. Proin malesuada adipiscing lacus. Donec metus. Curabitur gravida</blockquote>.

<br/>

</body>


Thanks
phossler is offline   Reply With Quote
Old 06-21-2015, 11:26 AM   #2
gbm
Wizard
gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.
 
Posts: 2,181
Karma: 8888888
Join Date: Jun 2010
Device: Kobo Clara HD,Hisence Sero 7 Pro RIP, Nook STR, jetbook lite
Quote:
Originally Posted by phossler View Post
Recently downloaded a epub report (I think) that might have started life as a .txt file (maybe). Example in the spoiler

Problem 1. There are no <p>...</p> tags, but just text 'paragraphs' separated by <br/>.

Problem 2. Sometimes there is <blockquote> around a block of text, but no <p> tags

Problem 3. Sometimes there are large 'blank' sections that are various combinations of PARAGRAPH SEPERATOR + SPACE or SPACE + PARAGRAPH SEPERATOR or PARAGRAPH SEPERATOR + PARAGRAPH SEPERATOR + PARAGRAPH SEPERATOR + etc. [Beautify] and [Fix HTML] don't clean them up, but a Find: \n{2,} Replace: \n does catch a lot, but leaves the spaces

Are there (probably) 3 RegEx's that will

1. Put <p> tags around the blocks of text that don't have any? I can delete the <br/>'s
2. Put <p> tags around the blocks of text inside <blockquotes>'s that don't have any? I can delete the <blockquote>'s manually if I need to
3. Remove the spaces that are outside <p>'s?


Spoiler:

Code:
<body>

<br/>
  Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Phasellus hendrerit. Pellentesque aliquet nibh nec urna. In nisi neque, aliquet vel, dapibus id, mattis vel, nisi. Sed pretium, ligula sollicitudin laoreet viverra, tortor libero sodales leo, eget blandit nunc tortor eu nibh. Nullam mollis. Ut justo. Suspendisse potenti.

<br/>

 
 

 







Sed egestas, ante et vulputate volutpat, eros pede semper est, vitae luctus metus libero eu augue. Morbi purus libero, faucibus adipiscing, commodo quis, gravida id, est. Sed lectus. Praesent elementum hendrerit tortor. Sed semper lorem at felis. Vestibulum volutpat, lacus a ultrices sagittis, mi neque euismod dui, eu pulvinar nunc sapien ornare nisl. Phasellus pede arcu, dapibus eu, fermentum et, dapibus sed, urna.

<br/>

Morbi interdum mollis sapien. Sed ac risus. Phasellus lacinia, magna a ullamcorper laoreet, lectus arcu pulvinar risus, vitae facilisis libero dolor a purus. Sed vel lacus. Mauris nibh felis, adipiscing varius, adipiscing in, lacinia vel, tellus. Suspendisse ac urna. Etiam pellentesque mauris ut lectus. Nunc tellus ante, mattis eget, gravida vitae, ultricies ac, leo. Integer leo pede, ornare a, lacinia eu, vulputate vel, nisl.

<br/>

Suspendisse mauris. Fusce accumsan mollis eros. Pellentesque a diam sit amet mi ullamcorper vehicula. Integer adipiscing risus a sem. Nullam quis massa sit amet nibh viverra malesuada. Nunc sem lacus, accumsan quis, faucibus non, congue vel, arcu. Ut scelerisque hendrerit tellus. Integer sagittis. Vivamus a mauris eget arcu gravida tristique. Nunc iaculis mi in ante. Vivamus imperdiet nibh feugiat est.

<br/>

<blockquote>Ut convallis, sem sit amet interdum consectetuer, odio augue aliquam leo, nec dapibus tortor nibh sed augue. Integer eu magna sit amet metus fermentum posuere. Morbi sit amet nulla sed dolor elementum imperdiet. Quisque fermentum. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Pellentesque adipiscing eros ut libero. Ut condimentum mi vel tellus. Suspendisse laoreet. Fusce ut est sed dolor gravida convallis. Morbi vitae ante. Vivamus ultrices luctus nunc. Suspendisse et dolor. Etiam dignissim. Proin malesuada adipiscing lacus. Donec metus. Curabitur gravida</blockquote>.

<br/>

</body>


Thanks
I used this:
Code:
find: <br/>\s*(.*?)\s*<
replace: <p>\1</p>\n<
Note: May take more than one pass to find all <br/>'s.
To get this:
Spoiler:
Code:
<body>

<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Phasellus hendrerit. Pellentesque aliquet nibh nec urna. In nisi neque, aliquet vel, dapibus id, mattis vel, nisi. Sed pretium, ligula sollicitudin laoreet viverra, tortor libero sodales leo, eget blandit nunc tortor eu nibh. Nullam mollis. Ut justo. Suspendisse potenti.</p>
<p>Sed egestas, ante et vulputate volutpat, eros pede semper est, vitae luctus metus libero eu augue. Morbi purus libero, faucibus adipiscing, commodo quis, gravida id, est. Sed lectus. Praesent elementum hendrerit tortor. Sed semper lorem at felis. Vestibulum volutpat, lacus a ultrices sagittis, mi neque euismod dui, eu pulvinar nunc sapien ornare nisl. Phasellus pede arcu, dapibus eu, fermentum et, dapibus sed, urna.</p>
<p>Morbi interdum mollis sapien. Sed ac risus. Phasellus lacinia, magna a ullamcorper laoreet, lectus arcu pulvinar risus, vitae facilisis libero dolor a purus. Sed vel lacus. Mauris nibh felis, adipiscing varius, adipiscing in, lacinia vel, tellus. Suspendisse ac urna. Etiam pellentesque mauris ut lectus. Nunc tellus ante, mattis eget, gravida vitae, ultricies ac, leo. Integer leo pede, ornare a, lacinia eu, vulputate vel, nisl.</p>
<p>Suspendisse mauris. Fusce accumsan mollis eros. Pellentesque a diam sit amet mi ullamcorper vehicula. Integer adipiscing risus a sem. Nullam quis massa sit amet nibh viverra malesuada. Nunc sem lacus, accumsan quis, faucibus non, congue vel, arcu. Ut scelerisque hendrerit tellus. Integer sagittis. Vivamus a mauris eget arcu gravida tristique. Nunc iaculis mi in ante. Vivamus imperdiet nibh feugiat est.</p>
<p></p>
<blockquote>Ut convallis, sem sit amet interdum consectetuer, odio augue aliquam leo, nec dapibus tortor nibh sed augue. Integer eu magna sit amet metus fermentum posuere. Morbi sit amet nulla sed dolor elementum imperdiet. Quisque fermentum. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Pellentesque adipiscing eros ut libero. Ut condimentum mi vel tellus. Suspendisse laoreet. Fusce ut est sed dolor gravida convallis. Morbi vitae ante. Vivamus ultrices luctus nunc. Suspendisse et dolor. Etiam dignissim. Proin malesuada adipiscing lacus. Donec metus. Curabitur gravida</blockquote>.

<p></p>
</body


bernie
gbm is offline   Reply With Quote
Advert
Old 08-23-2015, 11:27 AM   #3
Phssthpok
Age improves with wine.
Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.Phssthpok knows how to set a laser printer to stun.
 
Posts: 576
Karma: 95229
Join Date: Nov 2014
Device: Kindle Oasis, Kobo Libra II
You need four "replace all"s in total:
Quote:
Originally Posted by phossler View Post
1. Put <p> tags around the blocks of text that don't have any? I can delete the <br/>'s
Two steps:
(1) Wrap the body in <p>...</p>
Find: <body>(.*?)</body>
Replace: <body>\n<p>\1</p>\n</body>

(2) Replace all the <br/>s
Find: <br/>
Replace: </p>\n<p>

Quote:
Originally Posted by phossler View Post
2. Put <p> tags around the blocks of text inside <blockquotes>'s that don't have any? I can delete the <blockquote>'s manually if I need to
Find: <blockquote>(.*?)</blockquote>
Replace: <blockquote>\n<p>\1</p>\n</blockquote>

Or, if you just want plain paras instead of blockquotes,

Replace: <p>\1</p>

Quote:
Originally Posted by phossler View Post
3. Remove the spaces that are outside <p>'s?
Find: </p>\s*<p>
Replace: </p>\n<p>

Last edited by Phssthpok; 08-23-2015 at 11:29 AM.
Phssthpok is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
A New Epub Creator: txt to epub, word to epub oxen ePub 120 07-22-2019 02:28 PM
redo epub to epub - don't use original-epub cybmole Conversion 8 02-20-2014 05:21 AM
epub to epub conversion problem with regex spanning multiple input files ctop Conversion 2 02-12-2012 01:56 AM
[Old Thread] Reading epub on viewer inexplicably changes the time stamp of epub greenapple Library Management 20 03-19-2011 10:18 PM
epub, ePub, EPUB, warum blos ePub? flowoeB Lounge 5 11-27-2009 09:37 AM


All times are GMT -4. The time now is 06:23 AM.


MobileRead.com is a privately owned, operated and funded community.