|
|
#1 |
|
Junior Member
![]() Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
|
Removing Everything But Formatted Text
It's easy to remove all the non p-tags with a regex - and wind up with plain text - but I'm stumped about how to remove all the non p-tags except the ones within paragraphs (such as <span>, <em>, etc.). I've Googled around and the consensus seems to be that regex is useless for parsing nested HTML tags. Is that really true? |
|
|
|
|
|
#2 | |
|
Staff to 4 Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 10,707
Karma: 2485850
Join Date: Aug 2009
Location: The (original) Silicon Valley, USA
Device: Galaxy Tab 2,Black Astak PEz, K4NT(now Wifes)
|
Quote:
The Convert the (cleaned of all style) TXT to HTML/EPUB first. IIRC Notepad++ can strip HTML
__________________
Using: Ubuntu(32 bit):Oneric,Precise and XPpro SP3, W7HP(64)- - Libre Office w/Writer2EPUB
|
|
|
|
|
|
Enthusiast
|
|
|
|
#3 | |
|
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,824
Karma: 23146860
Join Date: Jan 2010
Device: Kindle Fire HD, Kindle 2
|
Quote:
__________________
“Politics: A strife of interests masquerading as a contest of principles. The conduct of public affairs for private advantage.” |
|
|
|
|
|
|
#4 | |
|
Junior Member
![]() Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
|
Quote:
I don't want to strip the style. I want to keep the style and formatting, and remove everything outside of the paragraphs, such as <div>, <script>, etc. I'd prefer to use Sigil, but how would I do this on Notepad ++? |
|
|
|
|
|
|
#5 |
|
Junior Member
![]() Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
|
In that case, why does Sigil - whose whole purpose is editing HTML - use regex? (Not complaining, Sigil is awesome. Just curious.)
|
|
|
|
|
|
#6 | ||
|
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,824
Karma: 23146860
Join Date: Jan 2010
Device: Kindle Fire HD, Kindle 2
|
Quote:
What you want to do goes beyond the normal definition of editing or even Searching & Replacing. You're looking for something that automatically transforms code into new code—new code whose conventions you want to be able to specify (and preferably with no data-loss). That's a whole different ball o' wax.. and not something that's easily incorporated into a program (not without hard-coding the transformation rules, anyway; which would seriously limit the feature's usefulness to an end-user). Quote:
__________________
“Politics: A strife of interests masquerading as a contest of principles. The conduct of public affairs for private advantage.” Last edited by DiapDealer; 02-24-2013 at 12:03 PM. |
||
|
|
|
|
|
#7 | ||||
|
Junior Member
![]() Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
|
Quote:
Quote:
Quote:
Quote:
|
||||
|
|
|
|
|
#8 |
|
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,824
Karma: 23146860
Join Date: Jan 2010
Device: Kindle Fire HD, Kindle 2
|
Sorry, I tried as best I could to explain, but apparently failed. Unfortunately, you may have to accept the fact that Sigil can't do what you want without understanding why.
__________________
“Politics: A strife of interests masquerading as a contest of principles. The conduct of public affairs for private advantage.” |
|
|
|
|
|
#9 |
|
Staff to 4 Cats
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 10,707
Karma: 2485850
Join Date: Aug 2009
Location: The (original) Silicon Valley, USA
Device: Galaxy Tab 2,Black Astak PEz, K4NT(now Wifes)
|
(?sm)</p>\s+(.+?)\s+<p>
Should work to remove things outside those tags Don't try this on any copy you want to be usable after you are done, BUT YOU WERE WARNED that there are other valid things between the closing </p> and the Next <p> that should not be removed: The list is big, so I am not wasting my time typing it.
__________________
Using: Ubuntu(32 bit):Oneric,Precise and XPpro SP3, W7HP(64)- - Libre Office w/Writer2EPUB
|
|
|
|
|
|
#10 |
|
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 305
Karma: 1489094
Join Date: Nov 2011
Device: none
|
What makes you think all style and formatting is defined inside the <p> tags?
|
|
|
|
|
|
#11 |
|
Junior Member
![]() Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
|
Very true! Sorry If I came across as pissy, I just don't understand why Sigil can choose expression x but not everything besides expression x. It can choose the letter "Q" and every letter besides "Q".
|
|
|
|
|
|
#12 |
|
Junior Member
![]() Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
|
|
|
|
|
|
|
#13 | |
|
Junior Member
![]() Posts: 9
Karma: 10
Join Date: Feb 2013
Device: Iphone 4
|
Quote:
\s<[^p/] Which works, but only occasionally. It operates under the assumption that the epub is well-formed, with lines between non-p tags and spaces between other tags. You're my hero! I'll definitely tinker with this regex some more! |
|
|
|
|
|
|
#14 |
|
Book Twiddler
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 960
Karma: 1087515
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
Beware!
With your fiddling with regex, you can make whole sections of text disappear in such a way that you won't see them until you have invested some amount of time afterwards in further editing.
With this kind of experimentation, you want to save with a new name every five minutes at least. If there is someone else who is familiar with your material, it would be beneficial for them to double check you because after fiddling for an hour or so, you will not see such changes or maybe even care! |
|
|
|
|
|
#15 |
|
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 305
Karma: 1489094
Join Date: Nov 2011
Device: none
|
Are we really asking to strip everything except paragraphs, bold and italic? I suppose it's a fair assumption these might always be inline, inside the <p> tags. Might be a good idea to retain header tags as well?
|
|
|
|
![]() |
| Thread Tools | Search this Thread |
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Help removing bold text | tecweston | Sigil | 5 | 02-08-2012 12:33 PM |
| Removing text from an ebook | mjt57 | Conversion | 3 | 04-29-2011 02:55 AM |
| Tool for removing line breaks in text documents | kahn10 | Sony Reader | 9 | 08-22-2010 10:05 PM |
| PDF Conversion - Removing Header / Footer Text | heb | Sony Reader | 9 | 07-11-2010 11:02 PM |
| Converting PDF - Removing text at top of pages | halljames | Calibre | 4 | 07-21-2009 07:00 AM |