| 
			
			 | 
		#1 | 
| 
			
			
			
			 Big Poppa 
			
			![]() Posts: 110 
				Karma: 10 
				Join Date: Jul 2010 
				
				
				
				Device: Nook 
				
				
				 | 
	
	
	
		
		
			
			 
				
				NYTimes - unclosed comment tag
			 
			
			
			Hi Kovid - 
		
	
		
		
		
		
		
		
		
		
		
		
		
			The NYTimes recipe has a collection of --> (sometimes twice) leftover in many articles, which seems to be an unmatched comment tag. I've tried replacing in preprocess and postprocess_html but can't seem to figure out how to do so. Any ideas the best way to remove this from articles? Code: 
	    def postprocess_html(self, soup, first_fetch):
        findcomment = soup.findAll(text = re.compile('--gt&;'))
        for comment in findcomment:
            fixed_text = unicode(comment).replace('--gt&;', '')
            comment.replace_with(fixed_text)
        return soup
Last edited by bobbysteel; 05-21-2018 at 10:15 AM.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#2 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			THe easiest way to get rid of comments is to use preprocess_regexps and simply replace them with an empty string. something like 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Code: 
	preprocess_regexps = [(re.compile(r'(?s)<!--.*?-->'), lambda m: '')]  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#3 | 
| 
			
			
			
			 Big Poppa 
			
			![]() Posts: 110 
				Karma: 10 
				Join Date: Jul 2010 
				
				
				
				Device: Nook 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Great thanks. Shall I check this in once I've tested?
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#4 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Sure.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#5 | 
| 
			
			
			
			 Big Poppa 
			
			![]() Posts: 110 
				Karma: 10 
				Join Date: Jul 2010 
				
				
				
				Device: Nook 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Also I'm also getting a very occasional error on the existing recipe 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Code: 
	has_supplemental = article.find(**classes('story-body-supplemental')) is not None
Code: 
	Could not fetch link https://www.nytimes.com/2018/05/18/watching/cheers-best-episodes.html Traceback (most recent call last): File "site-packages\calibre\web\fetch\simple.py", line 520, in process_links File "site-packages\calibre\web\fetch\simple.py", line 227, in get_soup File "<string>", line 107, in preprocess_html AttributeError: 'NoneType' object has no attribute 'find'  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#6 | 
| 
			
			
			
			 Big Poppa 
			
			![]() Posts: 110 
				Karma: 10 
				Join Date: Jul 2010 
				
				
				
				Device: Nook 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Hm, I also can't seem to get the result from the regexp you suggested.  
		
	
		
		
		
		
		
		
		
		
		
		
	
	Code: 
	    preprocess_regexps = [(re.compile(r'(?s)-->'), lambda m: '')]
    preprocess_regexps = [(re.compile(r'(?s)-->'), lambda m: '')]
Code: 
	<p class="calibre_2">-->--> </p>  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#7 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			That's not the regexp I suggested, you need both the opening and closing parts of the comment. If you want to debug it or do the reaplcing manually you can implement preprocess_raw_html in the recipe.  
		
	
		
		
		
		
		
		
		
		
		
		
	
	And I committed a fix for the AttributeError  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#8 | 
| 
			
			
			
			 Big Poppa 
			
			![]() Posts: 110 
				Karma: 10 
				Join Date: Jul 2010 
				
				
				
				Device: Nook 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			I tried the version you suggested, but shortened as it seems to be missing only the closing part of the comment no?
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#9 | 
| 
			
			
			
			 Big Poppa 
			
			![]() Posts: 110 
				Karma: 10 
				Join Date: Jul 2010 
				
				
				
				Device: Nook 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Doh ok i tried again and it worked. Silly me.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#10 | 
| 
			
			
			
			 Big Poppa 
			
			![]() Posts: 110 
				Karma: 10 
				Join Date: Jul 2010 
				
				
				
				Device: Nook 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Weird. Kovid, the comment tag is coming back now. Why is it missing just the closing tag? Is there a parsing error somewhere missing the multi-line comments somehow? Your regexp misses it too even with DOTALL enabled.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#11 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Unless the NYT's markup is invaid (i.e. contains nested comments) that regexp will remove all comments. Implement preprocess_raw_html() in the recipe and save the raw html and look at it in an editor. Track down where the closing comment is coming from by compainr it to the final HTML generated byt he recipe.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#12 | 
| 
			
			
			
			 Big Poppa 
			
			![]() Posts: 110 
				Karma: 10 
				Join Date: Jul 2010 
				
				
				
				Device: Nook 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			How do I save the raw HTML easily from command line? Is there syntax to drop it in to a temp folder? 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Thx  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#13 | 
| 
			
			
			
			 creator of calibre 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,609 
				Karma: 28549044 
				Join Date: Oct 2006 
				Location: Mumbai, India 
				
				
				Device: Various 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			open('/path/to/tempfile.html', 'wb').write(raw_html)
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#14 | 
| 
			
			
			
			 Big Poppa 
			
			![]() Posts: 110 
				Karma: 10 
				Join Date: Jul 2010 
				
				
				
				Device: Nook 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			just looking at a random article it doesn't seem to be unmatched anywhere I can see. Just lots of comments with one space in between. i can't understand the source why this regex would fail with only 7 comment tags on page.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#15 | 
| 
			
			
			
			 Big Poppa 
			
			![]() Posts: 110 
				Karma: 10 
				Join Date: Jul 2010 
				
				
				
				Device: Nook 
				
				
				 | 
	
	|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
![]()  | 
            
        
            
            
  | 
    
			 
			Similar Threads
		 | 
	||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Copy custom tag to author tag | Lzyslckr | Library Management | 3 | 11-25-2017 03:48 PM | 
| Wondering if there is a way to remove end tag with beginning tag | LadyKate | Editor | 5 | 06-29-2016 05:32 PM | 
| suggestion: tag groups should use Calibre tag hierarchy | comox | Calibre Companion | 53 | 05-25-2015 08:22 PM | 
| Send tag to device only if tag has more than 1 book? | eosrose | Calibre | 0 | 01-29-2013 08:46 PM | 
| Adding an Owner tag to tag list? | Fangles | Library Management | 1 | 02-25-2011 03:32 AM |