05-21-2018, 09:07 AM | #1 |
Big Poppa
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
|
NYTimes - unclosed comment tag
Hi Kovid -
The NYTimes recipe has a collection of --> (sometimes twice) leftover in many articles, which seems to be an unmatched comment tag. I've tried replacing in preprocess and postprocess_html but can't seem to figure out how to do so. Any ideas the best way to remove this from articles? Code:
def postprocess_html(self, soup, first_fetch): findcomment = soup.findAll(text = re.compile('--gt&;')) for comment in findcomment: fixed_text = unicode(comment).replace('--gt&;', '') comment.replace_with(fixed_text) return soup Last edited by bobbysteel; 05-21-2018 at 09:15 AM. |
05-21-2018, 09:38 AM | #2 |
creator of calibre
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
THe easiest way to get rid of comments is to use preprocess_regexps and simply replace them with an empty string. something like
Code:
preprocess_regexps = [(re.compile(r'(?s)<!--.*?-->'), lambda m: '')] |
Advert | |
|
05-21-2018, 10:03 AM | #3 |
Big Poppa
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
|
Great thanks. Shall I check this in once I've tested?
|
05-21-2018, 10:09 AM | #4 |
creator of calibre
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Sure.
|
05-21-2018, 10:20 AM | #5 |
Big Poppa
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
|
Also I'm also getting a very occasional error on the existing recipe
Code:
has_supplemental = article.find(**classes('story-body-supplemental')) is not None Code:
Could not fetch link https://www.nytimes.com/2018/05/18/watching/cheers-best-episodes.html Traceback (most recent call last): File "site-packages\calibre\web\fetch\simple.py", line 520, in process_links File "site-packages\calibre\web\fetch\simple.py", line 227, in get_soup File "<string>", line 107, in preprocess_html AttributeError: 'NoneType' object has no attribute 'find' |
Advert | |
|
05-21-2018, 10:29 AM | #6 |
Big Poppa
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
|
Hm, I also can't seem to get the result from the regexp you suggested.
Code:
preprocess_regexps = [(re.compile(r'(?s)-->'), lambda m: '')] preprocess_regexps = [(re.compile(r'(?s)-->'), lambda m: '')] Code:
<p class="calibre_2">-->--> </p> |
05-21-2018, 12:01 PM | #7 |
creator of calibre
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That's not the regexp I suggested, you need both the opening and closing parts of the comment. If you want to debug it or do the reaplcing manually you can implement preprocess_raw_html in the recipe.
And I committed a fix for the AttributeError |
05-21-2018, 12:11 PM | #8 |
Big Poppa
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
|
I tried the version you suggested, but shortened as it seems to be missing only the closing part of the comment no?
|
05-21-2018, 12:27 PM | #9 |
Big Poppa
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
|
Doh ok i tried again and it worked. Silly me.
|
05-22-2018, 06:17 AM | #10 |
Big Poppa
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
|
Weird. Kovid, the comment tag is coming back now. Why is it missing just the closing tag? Is there a parsing error somewhere missing the multi-line comments somehow? Your regexp misses it too even with DOTALL enabled.
|
05-22-2018, 06:24 AM | #11 |
creator of calibre
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Unless the NYT's markup is invaid (i.e. contains nested comments) that regexp will remove all comments. Implement preprocess_raw_html() in the recipe and save the raw html and look at it in an editor. Track down where the closing comment is coming from by compainr it to the final HTML generated byt he recipe.
|
05-22-2018, 06:32 AM | #12 |
Big Poppa
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
|
How do I save the raw HTML easily from command line? Is there syntax to drop it in to a temp folder?
Thx |
05-22-2018, 06:35 AM | #13 |
creator of calibre
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
open('/path/to/tempfile.html', 'wb').write(raw_html)
|
05-22-2018, 06:36 AM | #14 |
Big Poppa
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
|
just looking at a random article it doesn't seem to be unmatched anywhere I can see. Just lots of comments with one space in between. i can't understand the source why this regex would fail with only 7 comment tags on page.
|
05-22-2018, 06:42 AM | #15 |
Big Poppa
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Copy custom tag to author tag | Lzyslckr | Library Management | 3 | 11-25-2017 02:48 PM |
Wondering if there is a way to remove end tag with beginning tag | LadyKate | Editor | 5 | 06-29-2016 04:32 PM |
suggestion: tag groups should use Calibre tag hierarchy | comox | Calibre Companion | 53 | 05-25-2015 07:22 PM |
Send tag to device only if tag has more than 1 book? | eosrose | Calibre | 0 | 01-29-2013 07:46 PM |
Adding an Owner tag to tag list? | Fangles | Library Management | 1 | 02-25-2011 02:32 AM |