11-01-2018, 03:21 PM | #16 |
Enthusiast
Posts: 31
Karma: 32
Join Date: Jan 2012
Device: Kindle Paperwhite
|
Thanks from me as well!
|
11-02-2018, 09:42 AM | #17 |
Junior Member
Posts: 5
Karma: 10
Join Date: Mar 2012
Device: Kobo Aura H2O2, Kobo Aura
|
Thanks, Kovid! Back up and running, just in time for the Friday reviews.
|
Advert | |
|
11-04-2018, 06:01 AM | #18 |
Member
Posts: 16
Karma: 10
Join Date: Sep 2010
Device: Kindle
|
I copied the GitHub text and loaded in Calibre and customised the non-web edition. I seem to be getting only 3 articles for many of the sections - which is unusual for a Sunday edition. Is there any parameter I should be setting to ensure I get all articles per section? Thanks
|
11-04-2018, 08:27 AM | #19 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
the new todays paper page of the NYT has only three articles in mowst sections in the HTML the rest are loaded by javascript, so the recipe does not pick them up
|
11-05-2018, 02:19 AM | #20 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
And I just committed some code to duplicate whatthe javascript is doing, so there should be more articles now. https://github.com/kovidgoyal/calibr...ef713c9070937c
|
Advert | |
|
11-06-2018, 03:55 AM | #21 |
Big Poppa
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
|
Thanks Kovid. I'm getting a lot of carbage now in the form of newsletter signups, related items, etc. Theyir new tag system seems to use a format of tags like css-<7 chars> <8 chars>
Is there a way to add to remove_tags a match where class matches re.compile(/css-.{7}\w.{8}/) or such? Also remove_tags_after = [dict(name=['articleBody'])] seems to be failing for me which would remove all the article signups. Is something wrong with that syntax? Last edited by bobbysteel; 11-06-2018 at 03:57 AM. |
11-06-2018, 04:15 AM | #22 |
Member
Posts: 16
Karma: 10
Join Date: Sep 2010
Device: Kindle
|
Great work - getting lots more articles now - thanks!
|
11-06-2018, 11:59 PM | #23 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
beautifulsoup supports arbitrary python functions for matching, or even regexps. Something lke:
Code:
remove_tags=[dict(attrs={'class':re.compile(r'pattern')})] |
11-07-2018, 07:03 AM | #24 | |
Big Poppa
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
|
Quote:
Code:
remove_tags_after = [dict(name=['articleBody'])] |
|
11-07-2018, 11:05 PM | #25 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
IIRC remove_tags_after needs to be a single dictionary, not a list of dictionaries.
|
11-08-2018, 03:43 AM | #26 |
Big Poppa
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
|
Is it just me or all the headers now randomly mismatch? Each run I get a different selection of articles under each header seemingly at random.
|
11-08-2018, 03:56 AM | #27 |
Big Poppa
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
|
Yes retesting with a fresh install on a clean VM, it's definitely
1) totally random in the order of article placement 2) the headings don't match up with the articles whatsoever Each subsequent run makes a totally different order of articles. From what I can tell the articles are all being downloaded but the logic to assign the heading to the id from the JSON is off somehow. I can't easily infer by looking at the code however or else I'd check in a PR. |
11-08-2018, 05:23 AM | #28 |
creator of calibre
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
that should take care of it, I donthave the time to actually run it and test, however.
https://github.com/kovidgoyal/calibr...cbfcdfe23707e2 |
11-08-2018, 03:06 PM | #29 |
Big Poppa
Posts: 110
Karma: 10
Join Date: Jul 2010
Device: Nook
|
Passes the bobbysteel regressions with flying colours thanks for this Kovid!
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
"The New York Times" recipe failing with error | mikebw | Recipes | 8 | 10-02-2015 05:48 PM |
"New York Times best-selling author" | Katsunami | General Discussions | 72 | 09-07-2014 09:17 PM |
"We will stop printing the New York Times sometime in the future" | Soldim | News | 8 | 09-12-2010 10:37 PM |
Not downloading "The New York Times - Latest News" | twister | Amazon Kindle | 0 | 01-17-2010 10:51 AM |
New York Times- "Microsoft and HP to Debut Courier Tomorrow" | Dulin's Books | News | 18 | 01-07-2010 12:11 AM |