Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 02-24-2013, 10:52 AM   #1
Dybbuk
Junior Member
Dybbuk began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
Removing Everything But Formatted Text

I've been looking for a way on Sigil to delete everything in an epub but the stuff between <p> tags. In other words, to remove everything in a file but <p.*/p>.

It's easy to remove all the non p-tags with a regex - and wind up with plain text - but I'm stumped about how to remove all the non p-tags except the ones within paragraphs (such as <span>, <em>, etc.). I've Googled around and the consensus seems to be that regex is useless for parsing nested HTML tags. Is that really true?
Dybbuk is offline   Reply With Quote
Old 02-24-2013, 11:46 AM   #2
theducks
Grand Sorcerer
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 15,250
Karma: 6020307
Join Date: Aug 2009
Location: (The original) Silicon Valley, USA
Device: Galaxy Tab 2, Astak Pocket Pro, K4NT
Quote:
Originally Posted by Dybbuk View Post
I've been looking for a way on Sigil to delete everything in an epub but the stuff between <p> tags. In other words, to remove everything in a file but <p.*/p>.

It's easy to remove all the non p-tags with a regex - and wind up with plain text - but I'm stumped about how to remove all the non p-tags except the ones within paragraphs (such as <span>, <em>, etc.). I've Googled around and the consensus seems to be that regex is useless for parsing nested HTML tags. Is that really true?
Why not use Calibre to convert to TXT?
The Convert the (cleaned of all style) TXT to HTML/EPUB first.

IIRC Notepad++ can strip HTML
theducks is online now   Reply With Quote
Old 02-24-2013, 12:16 PM   #3
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 9,536
Karma: 44002482
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
I've Googled around and the consensus seems to be that regex is useless for parsing nested HTML tags. Is that really true?
Maybe not utterly useless, but it's certainly not easy or foolproof using regex. To do it right, you need something capable of actually parsing html (which regex is certainly incapable of doing).
DiapDealer is online now   Reply With Quote
Old 02-24-2013, 12:18 PM   #4
Dybbuk
Junior Member
Dybbuk began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
Quote:
Originally Posted by theducks View Post
Why not use Calibre to convert to TXT?
The Convert the (cleaned of all style) TXT to HTML/EPUB first.

IIRC Notepad++ can strip HTML

I don't want to strip the style. I want to keep the style and formatting, and remove everything outside of the paragraphs, such as <div>, <script>, etc.

I'd prefer to use Sigil, but how would I do this on Notepad ++?
Dybbuk is offline   Reply With Quote
Old 02-24-2013, 12:20 PM   #5
Dybbuk
Junior Member
Dybbuk began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
Quote:
Originally Posted by DiapDealer View Post
Maybe not utterly useless, but it's certainly not easy or foolproof using regex. To do it right, you need something capable of actually parsing html (which regex is certainly incapable of doing).
In that case, why does Sigil - whose whole purpose is editing HTML - use regex? (Not complaining, Sigil is awesome. Just curious.)
Dybbuk is offline   Reply With Quote
Old 02-24-2013, 12:51 PM   #6
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 9,536
Karma: 44002482
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by Dybbuk View Post
In that case, why does Sigil - whose whole purpose is editing HTML - use regex? (Not complaining, Sigil is awesome. Just curious.)
Because there's a difference between editing and parsing. And the Find & Replace feature in most editing software really has nothing to do with parsing code. F&R doesn't even know what "code" is. It's simply searching text for patterns you specify. Regex just happens to be one of the most flexible/powerful and common ways to achieve this.

What you want to do goes beyond the normal definition of editing or even Searching & Replacing. You're looking for something that automatically transforms code into new code—new code whose conventions you want to be able to specify (and preferably with no data-loss). That's a whole different ball o' wax.. and not something that's easily incorporated into a program (not without hard-coding the transformation rules, anyway; which would seriously limit the feature's usefulness to an end-user).

Quote:
I don't want to strip the style. I want to keep the style and formatting, and remove everything outside of the paragraphs, such as <div>, <script>, etc
What about headers? Blockquotes? There's all kinds of situations that can arise in ePubs where text you definitely don't want to lose occurs outside of the <p> tags.

Last edited by DiapDealer; 02-24-2013 at 01:03 PM.
DiapDealer is online now   Reply With Quote
Old 02-24-2013, 05:37 PM   #7
Dybbuk
Junior Member
Dybbuk began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
Quote:
Originally Posted by DiapDealer View Post
Because there's a difference between editing and parsing. And the Find & Replace feature in most editing software really has nothing to do with parsing code.
Okay, but we're talking about Sigil here. It has everything to do with parsing code. If regex can't parse HTML, why does Sigil use regex?

Quote:
Originally Posted by DiapDealer View Post
F&R doesn't even know what "code" is. It's simply searching text for patterns you specify. Regex just happens to be one of the most flexible/powerful and common ways to achieve this.
I'll believe that regex is flexible and powerful when it allows me to differentiate between stuff within a paragraph and stuff outside a paragraph. This is incredibly basic parsing, and I'm shocked that regex can't handle it. Why can regex easily find "<p.*/p>" but is totally incapable of identifying stuff that is NOT "<p.*/p>."?

Quote:
Originally Posted by DiapDealer View Post
What you want to do goes beyond the normal definition of editing or even Searching & Replacing. You're looking for something that automatically transforms code into new code—new code whose conventions you want to be able to specify (and preferably with no data-loss). That's a whole different ball o' wax.. and not something that's easily incorporated into a program (not without hard-coding the transformation rules, anyway; which would seriously limit the feature's usefulness to an end-user).
Huh? All I want to do is delete stuff that's not between <p> and </p>! Sigil can easily find stuff that IS between <p> and </p>. It should be just as easy to select stuff that is NOT between them. It's text selection, nothing more. Trust me, I'm not attempting to sabotage or revolutionize code conventions. I'm just a humble epub converter trying to make sense of modern software.

Quote:
What about headers? Blockquotes? There's all kinds of situations that can arise in ePubs where text you definitely don't want to lose occurs outside of the <p> tags.
I definitely want to lose all of the stuff outside of the p tags. P-tags aren't really the issue. I just want to know why I can select things between a certain parameter but not things outside the same parameter.
Dybbuk is offline   Reply With Quote
Old 02-24-2013, 06:35 PM   #8
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 9,536
Karma: 44002482
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Sorry, I tried as best I could to explain, but apparently failed. Unfortunately, you may have to accept the fact that Sigil can't do what you want without understanding why.
DiapDealer is online now   Reply With Quote
Old 02-24-2013, 06:43 PM   #9
theducks
Grand Sorcerer
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 15,250
Karma: 6020307
Join Date: Aug 2009
Location: (The original) Silicon Valley, USA
Device: Galaxy Tab 2, Astak Pocket Pro, K4NT
(?sm)</p>\s+(.+?)\s+<p>
Should work to remove things outside those tags
Don't try this on any copy you want to be usable after you are done,

BUT YOU WERE WARNED that there are other valid things between the closing </p> and the Next <p> that should not be removed: The list is big, so I am not wasting my time typing it.
theducks is online now   Reply With Quote
Old 02-24-2013, 06:57 PM   #10
exaltedwombat
Evangelist
exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.
 
Posts: 462
Karma: 1703930
Join Date: Nov 2011
Device: none
Quote:
Originally Posted by Dybbuk View Post
I don't want to strip the style. I want to keep the style and formatting, and remove everything outside of the paragraphs, such as <div>, <script>, etc.

I'd prefer to use Sigil, but how would I do this on Notepad ++?
What makes you think all style and formatting is defined inside the <p> tags?
exaltedwombat is offline   Reply With Quote
Old 02-24-2013, 07:05 PM   #11
Dybbuk
Junior Member
Dybbuk began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
Quote:
Originally Posted by DiapDealer View Post
Sorry, I tried as best I could to explain, but apparently failed. Unfortunately, you may have to accept the fact that Sigil can't do what you want without understanding why.
Very true! Sorry If I came across as pissy, I just don't understand why Sigil can choose expression x but not everything besides expression x. It can choose the letter "Q" and every letter besides "Q".
Dybbuk is offline   Reply With Quote
Old 02-24-2013, 07:06 PM   #12
Dybbuk
Junior Member
Dybbuk began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
Quote:
Originally Posted by exaltedwombat View Post
What makes you think all style and formatting is defined inside the <p> tags?
I don't. I was just trying to make clear that I don't want to erase the style and formatting tags inside paragraphs.
Dybbuk is offline   Reply With Quote
Old 02-24-2013, 07:21 PM   #13
Dybbuk
Junior Member
Dybbuk began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
Quote:
Originally Posted by theducks View Post
(?sm)</p>\s+(.+?)\s+<p>
Should work to remove things outside those tags
Don't try this on any copy you want to be usable after you are done,

BUT YOU WERE WARNED that there are other valid things between the closing </p> and the Next <p> that should not be removed: The list is big, so I am not wasting my time typing it.
Cool! I've tried it on several epubs but it often selects p-tag stuff. Maybe I'm doing it wrong? I'm using Sigil 7, under various search settings. Earlier I tried exploiting the fact that angle brackets within paragraphs usually have a space before them with this regex:

\s<[^p/]

Which works, but only occasionally. It operates under the assumption that the epub is well-formed, with lines between non-p tags and spaces between other tags.

You're my hero! I'll definitely tinker with this regex some more!
Dybbuk is offline   Reply With Quote
Old 02-25-2013, 08:14 AM   #14
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,086
Karma: 1444487
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
Beware!

With your fiddling with regex, you can make whole sections of text disappear in such a way that you won't see them until you have invested some amount of time afterwards in further editing.

With this kind of experimentation, you want to save with a new name every five minutes at least.

If there is someone else who is familiar with your material, it would be beneficial for them to double check you because after fiddling for an hour or so, you will not see such changes or maybe even care!
mrmikel is offline   Reply With Quote
Old 02-25-2013, 08:49 AM   #15
exaltedwombat
Evangelist
exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.exaltedwombat ought to be getting tired of karma fortunes by now.
 
Posts: 462
Karma: 1703930
Join Date: Nov 2011
Device: none
Are we really asking to strip everything except paragraphs, bold and italic? I suppose it's a fair assumption these might always be inline, inside the <p> tags. Might be a good idea to retain header tags as well?
exaltedwombat is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Help removing bold text tecweston Sigil 5 02-08-2012 01:33 PM
Removing text from an ebook mjt57 Conversion 3 04-29-2011 03:55 AM
Tool for removing line breaks in text documents kahn10 Sony Reader 9 08-22-2010 11:05 PM
PDF Conversion - Removing Header / Footer Text heb Sony Reader 9 07-12-2010 12:02 AM
Converting PDF - Removing text at top of pages halljames Calibre 4 07-21-2009 08:00 AM


All times are GMT -4. The time now is 11:19 AM.


MobileRead.com is a privately owned, operated and funded community.