NYTimes - unclosed comment tag

bobbysteel · 05-21-2018, 10:07 AM

Hi Kovid -
The NYTimes recipe has a collection of --> (sometimes twice) leftover in many articles, which seems to be an unmatched comment tag. I've tried replacing in preprocess and postprocess_html but can't seem to figure out how to do so. Any ideas the best way to remove this from articles?

Code:

    def postprocess_html(self, soup, first_fetch):
        findcomment = soup.findAll(text = re.compile('--gt&;'))
        for comment in findcomment:
            fixed_text = unicode(comment).replace('--gt&;', '')
            comment.replace_with(fixed_text)
        return soup

Doesn't seem to work and returns articles with nothing but the -->

kovidgoyal · 05-21-2018, 10:38 AM

THe easiest way to get rid of comments is to use preprocess_regexps and simply replace them with an empty string. something like

Code:

preprocess_regexps = [(re.compile(r'(?s)<!--.*?-->'), lambda m: '')]

bobbysteel · 05-21-2018, 11:03 AM

Great thanks. Shall I check this in once I've tested?

kovidgoyal · 05-21-2018, 11:09 AM

Sure.

bobbysteel · 05-21-2018, 11:20 AM

Also I'm also getting a very occasional error on the existing recipe

Code:

has_supplemental = article.find(**classes('story-body-supplemental')) is not None

is generating

Code:

Could not fetch link https://www.nytimes.com/2018/05/18/watching/cheers-best-episodes.html
Traceback (most recent call last):
  File "site-packages\calibre\web\fetch\simple.py", line 520, in process_links
  File "site-packages\calibre\web\fetch\simple.py", line 227, in get_soup
  File "<string>", line 107, in preprocess_html
AttributeError: 'NoneType' object has no attribute 'find'

Is there a way to make this not fail?

bobbysteel · 05-21-2018, 11:29 AM

Hm, I also can't seem to get the result from the regexp you suggested.

Code:

    preprocess_regexps = [(re.compile(r'(?s)-->'), lambda m: '')]
    preprocess_regexps = [(re.compile(r'(?s)--&gt;'), lambda m: '')]

Still returns this in the final HTML:

Code:

<p class="calibre_2">--&gt;--&gt;  </p>

kovidgoyal · 05-21-2018, 01:01 PM

That's not the regexp I suggested, you need both the opening and closing parts of the comment. If you want to debug it or do the reaplcing manually you can implement preprocess_raw_html in the recipe.

And I committed a fix for the AttributeError

bobbysteel · 05-21-2018, 01:11 PM

I tried the version you suggested, but shortened as it seems to be missing only the closing part of the comment no?

bobbysteel · 05-21-2018, 01:27 PM

Doh ok i tried again and it worked. Silly me.

bobbysteel · 05-22-2018, 07:17 AM

Weird. Kovid, the comment tag is coming back now. Why is it missing just the closing tag? Is there a parsing error somewhere missing the multi-line comments somehow? Your regexp misses it too even with DOTALL enabled.

kovidgoyal · 05-22-2018, 07:24 AM

Unless the NYT's markup is invaid (i.e. contains nested comments) that regexp will remove all comments. Implement preprocess_raw_html() in the recipe and save the raw html and look at it in an editor. Track down where the closing comment is coming from by compainr it to the final HTML generated byt he recipe.

bobbysteel · 05-22-2018, 07:32 AM

How do I save the raw HTML easily from command line? Is there syntax to drop it in to a temp folder?
Thx

kovidgoyal · 05-22-2018, 07:35 AM

open('/path/to/tempfile.html', 'wb').write(raw_html)

bobbysteel · 05-22-2018, 07:36 AM

just looking at a random article it doesn't seem to be unmatched anywhere I can see. Just lots of comments with one space in between. i can't understand the source why this regex would fail with only 7 comment tags on page.

bobbysteel · 05-22-2018, 07:42 AM

Quote:

Originally Posted by bobbysteel

How do I save the raw HTML easily from command line? Is there syntax to drop it in to a temp folder?
Thx

where do i drop that line? in preprocess subroutine it fails.

05-21-2018, 10:07 AM	#1
bobbysteel Big Poppa Posts: 110 Karma: 10 Join Date: Jul 2010 Device: Nook	NYTimes - unclosed comment tag Hi Kovid - The NYTimes recipe has a collection of --> (sometimes twice) leftover in many articles, which seems to be an unmatched comment tag. I've tried replacing in preprocess and postprocess_html but can't seem to figure out how to do so. Any ideas the best way to remove this from articles? Code: def postprocess_html(self, soup, first_fetch): findcomment = soup.findAll(text = re.compile('--gt&;')) for comment in findcomment: fixed_text = unicode(comment).replace('--gt&;', '') comment.replace_with(fixed_text) return soup Doesn't seem to work and returns articles with nothing but the --> Last edited by bobbysteel; 05-21-2018 at 10:15 AM.

05-21-2018, 10:38 AM	#2
kovidgoyal creator of calibre Posts: 45,609 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	THe easiest way to get rid of comments is to use preprocess_regexps and simply replace them with an empty string. something like Code: preprocess_regexps = [(re.compile(r'(?s)<!--.*?-->'), lambda m: '')]

05-21-2018, 11:29 AM	#6
bobbysteel Big Poppa Posts: 110 Karma: 10 Join Date: Jul 2010 Device: Nook	Hm, I also can't seem to get the result from the regexp you suggested. Code: preprocess_regexps = [(re.compile(r'(?s)-->'), lambda m: '')] preprocess_regexps = [(re.compile(r'(?s)-->'), lambda m: '')] Still returns this in the final HTML: Code: <p class="calibre_2">-->--> </p>

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Copy custom tag to author tag	Lzyslckr	Library Management	3	11-25-2017 03:48 PM
Wondering if there is a way to remove end tag with beginning tag	LadyKate	Editor	5	06-29-2016 05:32 PM
suggestion: tag groups should use Calibre tag hierarchy	comox	Calibre Companion	53	05-25-2015 08:22 PM
Send tag to device only if tag has more than 1 book?	eosrose	Calibre	0	01-29-2013 08:46 PM
Adding an Owner tag to tag list?	Fangles	Library Management	1	02-25-2011 03:32 AM

05-21-2018, 11:03 AM	#3
bobbysteel Big Poppa Posts: 110 Karma: 10 Join Date: Jul 2010 Device: Nook	Great thanks. Shall I check this in once I've tested?

05-21-2018, 11:09 AM	#4
kovidgoyal creator of calibre Posts: 45,609 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Sure.

05-21-2018, 01:01 PM	#7
kovidgoyal creator of calibre Posts: 45,609 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	That's not the regexp I suggested, you need both the opening and closing parts of the comment. If you want to debug it or do the reaplcing manually you can implement preprocess_raw_html in the recipe. And I committed a fix for the AttributeError

05-21-2018, 01:11 PM	#8
bobbysteel Big Poppa Posts: 110 Karma: 10 Join Date: Jul 2010 Device: Nook	I tried the version you suggested, but shortened as it seems to be missing only the closing part of the comment no?

05-21-2018, 01:27 PM	#9
bobbysteel Big Poppa Posts: 110 Karma: 10 Join Date: Jul 2010 Device: Nook	Doh ok i tried again and it worked. Silly me.

05-22-2018, 07:17 AM	#10
bobbysteel Big Poppa Posts: 110 Karma: 10 Join Date: Jul 2010 Device: Nook	Weird. Kovid, the comment tag is coming back now. Why is it missing just the closing tag? Is there a parsing error somewhere missing the multi-line comments somehow? Your regexp misses it too even with DOTALL enabled.

05-22-2018, 07:24 AM	#11
kovidgoyal creator of calibre Posts: 45,609 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Unless the NYT's markup is invaid (i.e. contains nested comments) that regexp will remove all comments. Implement preprocess_raw_html() in the recipe and save the raw html and look at it in an editor. Track down where the closing comment is coming from by compainr it to the final HTML generated byt he recipe.

05-22-2018, 07:32 AM	#12
bobbysteel Big Poppa Posts: 110 Karma: 10 Join Date: Jul 2010 Device: Nook	How do I save the raw HTML easily from command line? Is there syntax to drop it in to a temp folder? Thx

05-22-2018, 07:35 AM	#13
kovidgoyal creator of calibre Posts: 45,609 Karma: 28549044 Join Date: Oct 2006 Location: Mumbai, India Device: Various	open('/path/to/tempfile.html', 'wb').write(raw_html)

05-22-2018, 07:36 AM	#14
bobbysteel Big Poppa Posts: 110 Karma: 10 Join Date: Jul 2010 Device: Nook	just looking at a random article it doesn't seem to be unmatched anywhere I can see. Just lots of comments with one space in between. i can't understand the source why this regex would fail with only 7 comment tags on page.