"The New York Times" recipe fails - Page 2

nelson1379 · 11-01-2018, 03:21 PM

Thanks from me as well!

EMSBoys · 11-02-2018, 09:42 AM

Thanks, Kovid! Back up and running, just in time for the Friday reviews.

BillD · 11-04-2018, 06:01 AM

I copied the GitHub text and loaded in Calibre and customised the non-web edition. I seem to be getting only 3 articles for many of the sections - which is unusual for a Sunday edition. Is there any parameter I should be setting to ensure I get all articles per section? Thanks

kovidgoyal · 11-04-2018, 08:27 AM

the new todays paper page of the NYT has only three articles in mowst sections in the HTML the rest are loaded by javascript, so the recipe does not pick them up

kovidgoyal · 11-05-2018, 02:19 AM

And I just committed some code to duplicate whatthe javascript is doing, so there should be more articles now. https://github.com/kovidgoyal/calibr...ef713c9070937c

bobbysteel · 11-06-2018, 03:55 AM

Thanks Kovid. I'm getting a lot of carbage now in the form of newsletter signups, related items, etc. Theyir new tag system seems to use a format of tags like css-<7 chars> <8 chars>
Is there a way to add to remove_tags a match where class matches re.compile(/css-.{7}\w.{8}/) or such?

Also remove_tags_after = [dict(name=['articleBody'])] seems to be failing for me which would remove all the article signups. Is something wrong with that syntax?

BillD · 11-06-2018, 04:15 AM

Great work - getting lots more articles now - thanks!

kovidgoyal · 11-06-2018, 11:59 PM

beautifulsoup supports arbitrary python functions for matching, or even regexps. Something lke:

Code:

remove_tags=[dict(attrs={'class':re.compile(r'pattern')})]

bobbysteel · 11-07-2018, 07:03 AM

Quote:

Originally Posted by kovidgoyal

beautifulsoup supports arbitrary python functions for matching, or even regexps. Something lke:

Code:

remove_tags=[dict(attrs={'class':re.compile(r'pattern')})]

Thanks that works. But for the remove_after I'm getting a problem still - Also

Code:

remove_tags_after = [dict(name=['articleBody'])]

is something wrong w/ that where it wouldn't leave off sections after <section name='articleBody'>?

kovidgoyal · 11-07-2018, 11:05 PM

IIRC remove_tags_after needs to be a single dictionary, not a list of dictionaries.

bobbysteel · 11-08-2018, 03:43 AM

Is it just me or all the headers now randomly mismatch? Each run I get a different selection of articles under each header seemingly at random.

bobbysteel · 11-08-2018, 03:56 AM

Yes retesting with a fresh install on a clean VM, it's definitely
1) totally random in the order of article placement
2) the headings don't match up with the articles whatsoever

Each subsequent run makes a totally different order of articles. From what I can tell the articles are all being downloaded but the logic to assign the heading to the id from the JSON is off somehow. I can't easily infer by looking at the code however or else I'd check in a PR.

kovidgoyal · 11-08-2018, 05:23 AM

that should take care of it, I donthave the time to actually run it and test, however.

https://github.com/kovidgoyal/calibr...cbfcdfe23707e2

bobbysteel · 11-08-2018, 03:06 PM

Passes the bobbysteel regressions with flying colours

thanks for this Kovid!

11-06-2018, 03:55 AM	#21
bobbysteel Big Poppa Posts: 110 Karma: 10 Join Date: Jul 2010 Device: Nook	Thanks Kovid. I'm getting a lot of carbage now in the form of newsletter signups, related items, etc. Theyir new tag system seems to use a format of tags like css-<7 chars> <8 chars> Is there a way to add to remove_tags a match where class matches re.compile(/css-.{7}\w.{8}/) or such? Also remove_tags_after = [dict(name=['articleBody'])] seems to be failing for me which would remove all the article signups. Is something wrong with that syntax? Last edited by bobbysteel; 11-06-2018 at 03:57 AM.

11-06-2018, 11:59 PM	#23
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	beautifulsoup supports arbitrary python functions for matching, or even regexps. Something lke: Code: remove_tags=[dict(attrs={'class':re.compile(r'pattern')})]

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
"The New York Times" recipe failing with error	mikebw	Recipes	8	10-02-2015 05:48 PM
"New York Times best-selling author"	Katsunami	General Discussions	72	09-07-2014 09:17 PM
"We will stop printing the New York Times sometime in the future"	Soldim	News	8	09-12-2010 10:37 PM
Not downloading "The New York Times - Latest News"	twister	Amazon Kindle	0	01-17-2010 10:51 AM
New York Times- "Microsoft and HP to Debut Courier Tomorrow"	Dulin's Books	News	18	01-07-2010 12:11 AM

11-01-2018, 03:21 PM	#16
nelson1379 Enthusiast Posts: 31 Karma: 32 Join Date: Jan 2012 Device: Kindle Paperwhite	Thanks from me as well!

11-02-2018, 09:42 AM	#17
EMSBoys Junior Member Posts: 5 Karma: 10 Join Date: Mar 2012 Device: Kobo Aura H2O2, Kobo Aura	Thanks, Kovid! Back up and running, just in time for the Friday reviews.

11-04-2018, 06:01 AM	#18
BillD Member Posts: 16 Karma: 10 Join Date: Sep 2010 Device: Kindle	I copied the GitHub text and loaded in Calibre and customised the non-web edition. I seem to be getting only 3 articles for many of the sections - which is unusual for a Sunday edition. Is there any parameter I should be setting to ensure I get all articles per section? Thanks

11-04-2018, 08:27 AM	#19
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	the new todays paper page of the NYT has only three articles in mowst sections in the HTML the rest are loaded by javascript, so the recipe does not pick them up

11-05-2018, 02:19 AM	#20
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	And I just committed some code to duplicate whatthe javascript is doing, so there should be more articles now. https://github.com/kovidgoyal/calibr...ef713c9070937c

11-06-2018, 04:15 AM	#22
BillD Member Posts: 16 Karma: 10 Join Date: Sep 2010 Device: Kindle	Great work - getting lots more articles now - thanks!

11-07-2018, 11:05 PM	#25
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	IIRC remove_tags_after needs to be a single dictionary, not a list of dictionaries.

11-08-2018, 03:43 AM	#26
bobbysteel Big Poppa Posts: 110 Karma: 10 Join Date: Jul 2010 Device: Nook	Is it just me or all the headers now randomly mismatch? Each run I get a different selection of articles under each header seemingly at random.

11-08-2018, 03:56 AM	#27
bobbysteel Big Poppa Posts: 110 Karma: 10 Join Date: Jul 2010 Device: Nook	Yes retesting with a fresh install on a clean VM, it's definitely 1) totally random in the order of article placement 2) the headings don't match up with the articles whatsoever Each subsequent run makes a totally different order of articles. From what I can tell the articles are all being downloaded but the logic to assign the heading to the id from the JSON is off somehow. I can't easily infer by looking at the code however or else I'd check in a PR.

11-08-2018, 05:23 AM	#28
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	that should take care of it, I donthave the time to actually run it and test, however. https://github.com/kovidgoyal/calibr...cbfcdfe23707e2

Advert

Advert

11-08-2018, 03:06 PM	#29
bobbysteel Big Poppa Posts: 110 Karma: 10 Join Date: Jul 2010 Device: Nook	Passes the bobbysteel regressions with flying colours thanks for this Kovid!