View Single Post
Old 05-05-2021, 03:09 PM   #21
salamanderjuice
Guru
salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.salamanderjuice ought to be getting tired of karma fortunes by now.
 
Posts: 931
Karma: 13014268
Join Date: Jul 2017
Device: Boox Nova 2
For completeness here's an example using a HTML/XML parser in my programming language of choice R. I put one of the HTML snippets from this thread in a file called "test.html".

Code:
##install xml2 package and load it
install.packages("xml2")
library(xml2)

##read in the HTML file
arf = read_html("~/test.html",options="RECOVER")
##find all span nodes using xpath selectors
spans <- xml_find_all(arf,"//span")
##Replace them with just their text contents 
xml_replace(spans,xml_contents(spans))

##Write out the file
write_html(arf,"~/testOut.html")
This nets us an HTML file that looks like:
Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body id="xx" lang="en-US" style="width:396px;height:612px" xml:lang="en-US">
		<div class="Basic-Text-Frame" id="_idContainer250">
			<div style="width:5760px;height:9540px;position:absolute;top:0px;left:0px;-webkit-transform-origin: 0% 0%; -webkit-transform: translate(0px,5.83px) rotate(0deg) scale(0.05);transform-origin: 0% 0%; transform: translate(0px,5.83px) rotate(0deg) scale(0.05);">
				<p class="Chapter-Title ParaOverride-1">Time to Forgive</p>
<p class="Drop-Cap ParaOverride-1">“I want you to imagine your reflection in a beautiful mirror—the person who caused 
</p>
</div>
</div>
</body></html>
It's way easier to edit HTML/XML with a parser than regex.

EDIT: I Think R mangled the em dashes and quotes with it's crappy text support though.

Last edited by salamanderjuice; 05-05-2021 at 03:12 PM.
salamanderjuice is offline   Reply With Quote