View Single Post
Old 03-20-2018, 07:39 PM   #32
Divingduck
Wizard
Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.Divingduck ought to be getting tired of karma fortunes by now.
 
Posts: 1,166
Karma: 1410083
Join Date: Nov 2010
Location: Germany
Device: Sony PRS-650
Your welcome.

I had a bit time to take a closer look at the problem.

There are two things I saw.
One is, to remember when a regex will happen. You are using preprocess_regexps. This means this refer to the downloaded HTML as source input. Therefore you can check debug\input\ as your source for the regex to find out how the downloaded HTML file looks for calibre at the moment you are manipulate the file.
Second problem is the class you are looking for include spaces in its name and that do not to work (I think that had never work).

Taking that in account, I would make it slightly different. I don't take care about the complete class string, I look only for the end of the class name for a unique identification:

... c-overline--article"> ... </span> ...
Code:
(re.compile(r'(c-overline--article">[^>]*)(</span>)', re.DOTALL|re.IGNORECASE), lambda match: match.group(1) + ': ' + match.group(2))
I attach an updated version of the recipe.
Attached Files
File Type: zip WirtschaftsWoche_AGe_V4.3.zip (1.8 KB, 298 views)
Divingduck is offline   Reply With Quote