05-03-2018, 02:17 PM | #1 |
Connoisseur
Posts: 82
Karma: 10
Join Date: Dec 2015
Device: Kindle
|
Request for new recipie The Federalist
Attached are two versions of a new recipie for The Federalist. Would appreciate help per below.
The base version with auto cleanup turned off returns all the articles with clean text, including inline images but no author or date. It is difficult to test for images as the feed has a short list of articles and some days there are no inline images in any article, just the picture at top of heading. In the second version called Test I tried keep tags and remove tags to get author and date. Date appears for all articles but output format is odd [dd yyyy ^p mmm]. The author appears alongside date in the mobi for articles where the author byline appears inline with the article title and text on web page. Some articles have author in a left-sided sidebar and for that type I could not figure out how to specify the tags. Is it possible from either of these recipies to get author, date and possibly images? Thanks in advance. |
05-03-2018, 10:22 PM | #2 |
creator of calibre
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
You seem to have forgotten to attach the recipes.
|
05-04-2018, 08:52 AM | #3 |
Connoisseur
Posts: 82
Karma: 10
Join Date: Dec 2015
Device: Kindle
|
Recipies now attached.
|
05-04-2018, 11:16 AM | #4 |
creator of calibre
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Your first recipe is getting the content from the RSS feed, so it will only contain whatever content is in the RSS feed. The second recipe gets it from the actual web pages. I'm not sure what you are trying to do with huge number fo keep/remove tag specifications. You shouldn't need more than 3-4 in keep_only_tags. DOes this website have different formatting for different article pages? If so it would help if you posted links to a few of these pages.
|
05-04-2018, 11:24 AM | #5 |
creator of calibre
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I took a quick look, and this is what I came up with:
Code:
def classes(classes): q = frozenset(classes.split(' ')) return dict( attrs={'class': lambda x: x and frozenset(x.split()).intersection(q)} ) class AdvancedUserRecipe1502348373(BasicNewsRecipe): title = 'The Federalist' oldest_article = 7 max_articles_per_feed = 100 no_stylesheets = True encoding = 'utf-8' use_embedded_content = False remove_attributes = ['xmlns', 'lang', 'style', 'width', 'height'] keep_only_tags = [ classes('entry-header'), classes('wp-post-image post-categories entry-content shortbio'), ] feeds = [ ('All', 'http://thefederalist.com/feed/'), ] |
05-06-2018, 03:17 PM | #6 |
Connoisseur
Posts: 82
Karma: 10
Join Date: Dec 2015
Device: Kindle
|
I tweaked your version and now the byline (author and date) elements seem to appear for all articles (there are two different styles of the articles when I view them by clicking on link in the rss feed page and the byline is coded and appears differently in the two formats).
Revised recipe and mobi output attached (two files) Please note for some articles the article picture appears in the middle of the byline in the mobi. Not a big issue as the info is all there. One of the two article formats has a right-hand sidebar with 'Most Popular' and 'Related Posts' sections. Your version eliminates the text but the photos in those boxes appear in the mobi output and are not for the article itself. I tried to get them out with remove tags but could not. Could those images be eliminated? example of that article format linked below - http://thefederalist.com/2018/05/04/...f-destruction/ Thanks! |
05-06-2018, 09:52 PM | #7 |
creator of calibre
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
05-09-2018, 10:00 AM | #8 |
Connoisseur
Posts: 82
Karma: 10
Join Date: Dec 2015
Device: Kindle
|
Kovid, That revision worked.
Now just one more item please. Some articles include images inline with the content text that are not being captured in the mobi output. Some days the feed does not have any such articles. Here is an example article in today's feed - http://thefederalist.com/2018/05/09/...ng-oil-prices/ And here is what the html looks like for one of the images - <img class="aligncenter wp-image-182063" src="http://thefederalist.com/wp-content/uploads/2018/05/Lima5.8.c.jpg" alt="" data-portal-copyright="The Federalist" srcset="http://thefederalist.com/wp-content/uploads/2018/05/Lima5.8.c.jpg 955w, http://thefederalist.com/wp-content/....c-300x218.jpg 300w, http://thefederalist.com/wp-content/....c-768x558.jpg 768w, http://thefederalist.com/wp-content/....c-372x270.jpg 372w, http://thefederalist.com/wp-content/....c-200x145.jpg 200w, http://thefederalist.com/wp-content/....c-294x214.jpg 294w" sizes="(max-width: 600px) 100vw, 600px" style="display: block;" data-lazy-loaded="true" width="600" height="436"> Thanks. |
05-09-2018, 10:24 AM | #9 |
creator of calibre
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
05-11-2018, 12:37 PM | #10 |
Connoisseur
Posts: 82
Karma: 10
Join Date: Dec 2015
Device: Kindle
|
Perfect. Thanks.
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
NY Post Updated Recipie Request for Photos | jma1 | Recipes | 2 | 02-02-2018 06:00 AM |
Request for New Recipie of MumbaiMirror.com | rajshah | Recipes | 0 | 01-21-2012 07:38 AM |
History U.S. Founders: The Federalist Papers (PDF) | Last_of_the_PEs | Other Books | 0 | 05-25-2011 02:30 AM |
Government Publius: The Federalist Papers. eReader. 30 Jan 2008 | 6charlong | Other Books | 1 | 01-30-2008 04:52 PM |
Philosophy Hamilton, Jay, & Madison: The Federalist Papers. 07 Oct 07 | RWood | Kindle Books | 0 | 10-07-2007 10:19 PM |