Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Recipes

Notices

Reply
 
Thread Tools Search this Thread
Old 05-21-2018, 09:07 AM   #1
bobbysteel
Big Poppa
bobbysteel began at the beginning.
 
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
NYTimes - unclosed comment tag

Hi Kovid -
The NYTimes recipe has a collection of --> (sometimes twice) leftover in many articles, which seems to be an unmatched comment tag. I've tried replacing in preprocess and postprocess_html but can't seem to figure out how to do so. Any ideas the best way to remove this from articles?
Code:
    def postprocess_html(self, soup, first_fetch):
        findcomment = soup.findAll(text = re.compile('--gt&;'))
        for comment in findcomment:
            fixed_text = unicode(comment).replace('--gt&;', '')
            comment.replace_with(fixed_text)
        return soup
Doesn't seem to work and returns articles with nothing but the -->

Last edited by bobbysteel; 05-21-2018 at 09:15 AM.
bobbysteel is offline   Reply With Quote
Old 05-21-2018, 09:38 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
THe easiest way to get rid of comments is to use preprocess_regexps and simply replace them with an empty string. something like

Code:
preprocess_regexps = [(re.compile(r'(?s)<!--.*?-->'), lambda m: '')]
kovidgoyal is offline   Reply With Quote
Advert
Old 05-21-2018, 10:03 AM   #3
bobbysteel
Big Poppa
bobbysteel began at the beginning.
 
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
Great thanks. Shall I check this in once I've tested?
bobbysteel is offline   Reply With Quote
Old 05-21-2018, 10:09 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Sure.
kovidgoyal is offline   Reply With Quote
Old 05-21-2018, 10:20 AM   #5
bobbysteel
Big Poppa
bobbysteel began at the beginning.
 
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
Also I'm also getting a very occasional error on the existing recipe
Code:
has_supplemental = article.find(**classes('story-body-supplemental')) is not None
is generating
Code:
Could not fetch link https://www.nytimes.com/2018/05/18/watching/cheers-best-episodes.html
Traceback (most recent call last):
  File "site-packages\calibre\web\fetch\simple.py", line 520, in process_links
  File "site-packages\calibre\web\fetch\simple.py", line 227, in get_soup
  File "<string>", line 107, in preprocess_html
AttributeError: 'NoneType' object has no attribute 'find'
Is there a way to make this not fail?
bobbysteel is offline   Reply With Quote
Advert
Old 05-21-2018, 10:29 AM   #6
bobbysteel
Big Poppa
bobbysteel began at the beginning.
 
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
Hm, I also can't seem to get the result from the regexp you suggested.

Code:
    preprocess_regexps = [(re.compile(r'(?s)-->'), lambda m: '')]
    preprocess_regexps = [(re.compile(r'(?s)--&gt;'), lambda m: '')]
Still returns this in the final HTML:

Code:
<p class="calibre_2">--&gt;--&gt;  </p>
bobbysteel is offline   Reply With Quote
Old 05-21-2018, 12:01 PM   #7
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
That's not the regexp I suggested, you need both the opening and closing parts of the comment. If you want to debug it or do the reaplcing manually you can implement preprocess_raw_html in the recipe.

And I committed a fix for the AttributeError
kovidgoyal is offline   Reply With Quote
Old 05-21-2018, 12:11 PM   #8
bobbysteel
Big Poppa
bobbysteel began at the beginning.
 
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
I tried the version you suggested, but shortened as it seems to be missing only the closing part of the comment no?
bobbysteel is offline   Reply With Quote
Old 05-21-2018, 12:27 PM   #9
bobbysteel
Big Poppa
bobbysteel began at the beginning.
 
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
Doh ok i tried again and it worked. Silly me.
bobbysteel is offline   Reply With Quote
Old 05-22-2018, 06:17 AM   #10
bobbysteel
Big Poppa
bobbysteel began at the beginning.
 
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
Weird. Kovid, the comment tag is coming back now. Why is it missing just the closing tag? Is there a parsing error somewhere missing the multi-line comments somehow? Your regexp misses it too even with DOTALL enabled.
bobbysteel is offline   Reply With Quote
Old 05-22-2018, 06:24 AM   #11
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Unless the NYT's markup is invaid (i.e. contains nested comments) that regexp will remove all comments. Implement preprocess_raw_html() in the recipe and save the raw html and look at it in an editor. Track down where the closing comment is coming from by compainr it to the final HTML generated byt he recipe.
kovidgoyal is offline   Reply With Quote
Old 05-22-2018, 06:32 AM   #12
bobbysteel
Big Poppa
bobbysteel began at the beginning.
 
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
How do I save the raw HTML easily from command line? Is there syntax to drop it in to a temp folder?
Thx
bobbysteel is offline   Reply With Quote
Old 05-22-2018, 06:35 AM   #13
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
open('/path/to/tempfile.html', 'wb').write(raw_html)
kovidgoyal is offline   Reply With Quote
Old 05-22-2018, 06:36 AM   #14
bobbysteel
Big Poppa
bobbysteel began at the beginning.
 
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
just looking at a random article it doesn't seem to be unmatched anywhere I can see. Just lots of comments with one space in between. i can't understand the source why this regex would fail with only 7 comment tags on page.
bobbysteel is offline   Reply With Quote
Old 05-22-2018, 06:42 AM   #15
bobbysteel
Big Poppa
bobbysteel began at the beginning.
 
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
Quote:
Originally Posted by bobbysteel View Post
How do I save the raw HTML easily from command line? Is there syntax to drop it in to a temp folder?
Thx
where do i drop that line? in preprocess subroutine it fails.
bobbysteel is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Copy custom tag to author tag Lzyslckr Library Management 3 11-25-2017 02:48 PM
Wondering if there is a way to remove end tag with beginning tag LadyKate Editor 5 06-29-2016 04:32 PM
suggestion: tag groups should use Calibre tag hierarchy comox Calibre Companion 53 05-25-2015 07:22 PM
Send tag to device only if tag has more than 1 book? eosrose Calibre 0 01-29-2013 07:46 PM
Adding an Owner tag to tag list? Fangles Library Management 1 02-25-2011 02:32 AM


All times are GMT -4. The time now is 05:53 PM.


MobileRead.com is a privately owned, operated and funded community.