Thanks Divingduck & Kovid,
... I'm getting better on this one. - And slowly I'm things turn out as neat as wanted. However this one thing is bugging me:
In the source I have e.g. (lookout for the bold tags):
Code:
<h2 class="c-headline c-headline--article u-margin-m"><span
class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">Nikolai Gluschkow</span> Russischer Geschäftsmann tot in London entdeckt
</h2>
<div class="c-metadata u-margin-xl ">
<div>
</div>
<time
datetime="2018-03-13T19:38:19+01:00">13. März 2018</time>
<span>, aktualisiert
<time datetime="2018-03-13T19:40:57+01:00">13. März 2018, 19:40 Uhr</time>
</span>
<span class="c-metadata__source"> | Quelle: <a href="http://www.handelsblatt.com"
target="_blank">Handelsblatt Online</a></span>
</div>
[...]
<div class="o-article__content-element o-article__content-element--richtext">
<div class="u-richtext ajaxify"
data-command='{"richtext": {}}'>
<p><span class="hcf-location-mark">London</span>Ein mit dem 2013 verstorbenen Oligarchen Boris Beresowski befreundeter russischer Geschäftsmann ist in London tot aufgefunden worden. Nikolai Gluschkow sei nicht mehr am Leben, sagte Anwalt Andrej Borowkow am Dienstag russischen Medien. Er wisse aber nichts über die Umstände und den Zeitpunkt des Todes des 68-Jährigen.</p> </div>
</div>
But I do not understand, why I cant add a ": " after the Name "Nikolai Gluschkow". - As stated above, my code is derived from the "hcf-location-mark" bit and I just don't understand, why it's not working that way:
Code:
preprocess_regexps = [(re.compile(r'(<span class="hcf-location-mark">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + '. ' + match.group(2)),
(re.compile(r'(<span class="c-overline c-overline--alternate u-uppercase u-letter-spacing u-margin-m c-overline--article">[^<]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + ': ' + match.group(2))]
... just in case the colon (":") is the problem, I also tried html instead "& #058; " (without the space, otherwise it won't show up here) but still no avail ...
Any hints, as to what I'm doing wrong here?
Thanks a lot
Hegi