02-24-2013, 09:52 AM | #1 |
Junior Member
Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
|
Removing Everything But Formatted Text
I've been looking for a way on Sigil to delete everything in an epub but the stuff between <p> tags. In other words, to remove everything in a file but <p.*/p>.
It's easy to remove all the non p-tags with a regex - and wind up with plain text - but I'm stumped about how to remove all the non p-tags except the ones within paragraphs (such as <span>, <em>, etc.). I've Googled around and the consensus seems to be that regex is useless for parsing nested HTML tags. Is that really true? |
02-24-2013, 10:46 AM | #2 | |
Well trained by Cats
Posts: 30,373
Karma: 58053698
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
The Convert the (cleaned of all style) TXT to HTML/EPUB first. IIRC Notepad++ can strip HTML |
|
02-24-2013, 11:16 AM | #3 | |
Grand Sorcerer
Posts: 27,903
Karma: 198500000
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
|
|
02-24-2013, 11:18 AM | #4 | |
Junior Member
Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
|
Quote:
I don't want to strip the style. I want to keep the style and formatting, and remove everything outside of the paragraphs, such as <div>, <script>, etc. I'd prefer to use Sigil, but how would I do this on Notepad ++? |
|
02-24-2013, 11:20 AM | #5 |
Junior Member
Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
|
In that case, why does Sigil - whose whole purpose is editing HTML - use regex? (Not complaining, Sigil is awesome. Just curious.)
|
02-24-2013, 11:51 AM | #6 | ||
Grand Sorcerer
Posts: 27,903
Karma: 198500000
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
What you want to do goes beyond the normal definition of editing or even Searching & Replacing. You're looking for something that automatically transforms code into new code—new code whose conventions you want to be able to specify (and preferably with no data-loss). That's a whole different ball o' wax.. and not something that's easily incorporated into a program (not without hard-coding the transformation rules, anyway; which would seriously limit the feature's usefulness to an end-user). Quote:
Last edited by DiapDealer; 02-24-2013 at 12:03 PM. |
||
02-24-2013, 04:37 PM | #7 | ||||
Junior Member
Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
|
Quote:
Quote:
Quote:
Quote:
|
||||
02-24-2013, 05:35 PM | #8 |
Grand Sorcerer
Posts: 27,903
Karma: 198500000
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Sorry, I tried as best I could to explain, but apparently failed. Unfortunately, you may have to accept the fact that Sigil can't do what you want without understanding why.
|
02-24-2013, 05:43 PM | #9 |
Well trained by Cats
Posts: 30,373
Karma: 58053698
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
(?sm)</p>\s+(.+?)\s+<p>
Should work to remove things outside those tags Don't try this on any copy you want to be usable after you are done, BUT YOU WERE WARNED that there are other valid things between the closing </p> and the Next <p> that should not be removed: The list is big, so I am not wasting my time typing it. |
02-24-2013, 05:57 PM | #10 |
Guru
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
|
What makes you think all style and formatting is defined inside the <p> tags?
|
02-24-2013, 06:05 PM | #11 |
Junior Member
Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
|
Very true! Sorry If I came across as pissy, I just don't understand why Sigil can choose expression x but not everything besides expression x. It can choose the letter "Q" and every letter besides "Q".
|
02-24-2013, 06:06 PM | #12 |
Junior Member
Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
|
|
02-24-2013, 06:21 PM | #13 | |
Junior Member
Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
|
Quote:
\s<[^p/] Which works, but only occasionally. It operates under the assumption that the epub is well-formed, with lines between non-p tags and spaces between other tags. You're my hero! I'll definitely tinker with this regex some more! |
|
02-25-2013, 07:14 AM | #14 |
Color me gone
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
Beware!
With your fiddling with regex, you can make whole sections of text disappear in such a way that you won't see them until you have invested some amount of time afterwards in further editing.
With this kind of experimentation, you want to save with a new name every five minutes at least. If there is someone else who is familiar with your material, it would be beneficial for them to double check you because after fiddling for an hour or so, you will not see such changes or maybe even care! |
02-25-2013, 07:49 AM | #15 |
Guru
Posts: 878
Karma: 2457540
Join Date: Nov 2011
Device: none
|
Are we really asking to strip everything except paragraphs, bold and italic? I suppose it's a fair assumption these might always be inline, inside the <p> tags. Might be a good idea to retain header tags as well?
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Help removing bold text | tecweston | Sigil | 5 | 02-08-2012 12:33 PM |
Removing text from an ebook | mjt57 | Conversion | 3 | 04-29-2011 02:55 AM |
Tool for removing line breaks in text documents | kahn10 | Sony Reader | 9 | 08-22-2010 10:05 PM |
PDF Conversion - Removing Header / Footer Text | heb | Sony Reader | 9 | 07-11-2010 11:02 PM |
Converting PDF - Removing text at top of pages | halljames | Calibre | 4 | 07-21-2009 07:00 AM |