09-13-2023, 08:47 AM | #1 |
Connoisseur
Posts: 90
Karma: 50742
Join Date: Jan 2011
Device: PW5
|
Support for HTTP 308 redirects
Recipe fails when the url responds with a HTTP308.
Sample recipe below Code:
from calibre.web.feeds.news import BasicNewsRecipe class Http308RedirectRecipe(BasicNewsRecipe): title = "Http 308 Redirect Recipe" language = "en" def parse_index(self): return [ ( "Example", [ { "url": "https://www.wsj.com/articles/mccarthy-biden-impeachment-inquiry-b9cc6530", "title": "This url responds with a HTTP 308 redirect", } ], ), ] Code:
Fetching https://www.wsj.com/articles/mccarthy-biden-impeachment-inquiry-b9cc6530 Could not fetch link https://www.wsj.com/articles/mccarthy-biden-impeachment-inquiry-b9cc6530 Traceback (most recent call last): File "calibre/web/fetch/simple.py", line 278, in fetch_url File "mechanize/_mechanize.py", line 241, in open_novisit File "mechanize/_mechanize.py", line 313, in _mech_open mechanize._response.get_seek_wrapper_class.<locals>.httperror_seek_wrapper: HTTP Error 308: Permanent Redirect Code:
$ curl -A 'Mozilla/5.0' -Ii 'https://www.wsj.com/articles/mccarthy-biden-impeachment-inquiry-b9cc6530' HTTP/2 308 content-length: 0 location: https://www.wsj.com/politics/mccarthy-biden-impeachment-inquiry-b9cc6530 date: Wed, 13 Sep 2023 12:46:09 GMT x-proxy-cache: BYPASS x-cache: Miss from cloudfront via: 1.1 53b2bbb13e5db590d598ee4e9aa9bd80.cloudfront.net (CloudFront) x-amz-cf-pop: HKG62-C2 x-amz-cf-id: mtOFJ7gK44Weycg7uPApuysAAakLzPg2kojmOCpiNpVi_Yk1bP7qZg== |
09-13-2023, 08:55 AM | #2 |
creator of calibre
Posts: 43,966
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Your recipe worked fine for me in current calibre and the recipe system most definitely handles redirects. From the use of curl I am guessing you are on linux. Dont use whatever distro calibre package use the official binaries.
|
Advert | |
|
09-13-2023, 09:15 AM | #3 |
Connoisseur
Posts: 90
Karma: 50742
Join Date: Jan 2011
Device: PW5
|
Apologies for cutting too much of the log. I'm running the official 6.26.0 on macOS.
I dug around and it looks like it's because calibre is pinned to mechanize v0.4.7 but support for HTTP308 redirects is available only from v0.4.8 (commit). Code:
$ ebook-convert 'http308.recipe' .epub --test --debug-pipeline debug -vv Conversion options changed from defaults: test: (2, 2) debug_pipeline: 'debug' verbose: 2 Resolved conversion options calibre version: 6.26.0 {'asciiize': False, 'author_sort': None, 'authors': None, 'base_font_size': 0, 'book_producer': None, 'change_justification': 'original', 'chapter': None, 'chapter_mark': 'pagebreak', 'comments': None, 'cover': None, 'debug_pipeline': 'debug', 'dehyphenate': True, 'delete_blank_paragraphs': True, 'disable_font_rescaling': False, 'dont_download_recipe': False, 'dont_split_on_page_breaks': True, 'duplicate_links_in_toc': False, 'embed_all_fonts': False, 'embed_font_family': None, 'enable_heuristics': False, 'epub_flatten': False, 'epub_inline_toc': False, 'epub_max_image_size': 'none', 'epub_toc_at_end': False, 'epub_version': '2', 'expand_css': False, 'extra_css': None, 'extract_to': None, 'filter_css': None, 'fix_indents': True, 'flow_size': 260, 'font_size_mapping': None, 'format_scene_breaks': True, 'html_unwrap_factor': 0.4, 'input_encoding': None, 'input_profile': <calibre.customize.profiles.InputProfile object at 0x118f1fbe0>, 'insert_blank_line': False, 'insert_blank_line_size': 0.5, 'insert_metadata': False, 'isbn': None, 'italicize_common_cases': True, 'keep_ligatures': False, 'language': None, 'level1_toc': None, 'level2_toc': None, 'level3_toc': None, 'line_height': 0, 'linearize_tables': False, 'lrf': False, 'margin_bottom': 5.0, 'margin_left': 5.0, 'margin_right': 5.0, 'margin_top': 5.0, 'markup_chapter_headings': True, 'max_toc_links': 50, 'minimum_line_height': 120.0, 'no_chapters_in_toc': False, 'no_default_epub_cover': False, 'no_inline_navbars': False, 'no_svg_cover': False, 'output_profile': <calibre.customize.profiles.OutputProfile object at 0x118f1ead0>, 'page_breaks_before': None, 'prefer_metadata_cover': False, 'preserve_cover_aspect_ratio': False, 'pretty_print': True, 'pubdate': None, 'publisher': None, 'rating': None, 'read_metadata_from_opf': None, 'remove_fake_margins': True, 'remove_first_image': False, 'remove_paragraph_spacing': False, 'remove_paragraph_spacing_indent_size': 1.5, 'renumber_headings': True, 'replace_scene_breaks': '', 'search_replace': None, 'series': None, 'series_index': None, 'smarten_punctuation': False, 'sr1_replace': '', 'sr1_search': '', 'sr2_replace': '', 'sr2_search': '', 'sr3_replace': '', 'sr3_search': '', 'start_reading_at': None, 'subset_embedded_fonts': False, 'tags': None, 'test': (2, 2), 'timestamp': None, 'title': None, 'title_sort': None, 'toc_filter': None, 'toc_threshold': 6, 'toc_title': None, 'transform_css_rules': None, 'transform_html_rules': None, 'unsmarten_punctuation': False, 'unwrap_lines': True, 'use_auto_toc': False, 'verbose': 2} 1% Converting input to HTML... InputFormatPlugin: Recipe Input running Using custom recipe Using user agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 1% Fetching feeds... 1% Got feeds from index page 1% Trying to download cover... 1% Generating masthead... Synthesizing mastheadImage 1% Starting download [4 threads]... Fetching https://www.wsj.com/articles/mccarthy-biden-impeachment-inquiry-b9cc6530 Could not fetch link https://www.wsj.com/articles/mccarthy-biden-impeachment-inquiry-b9cc6530 Traceback (most recent call last): File "calibre/web/fetch/simple.py", line 278, in fetch_url File "mechanize/_mechanize.py", line 241, in open_novisit File "mechanize/_mechanize.py", line 313, in _mech_open mechanize._response.get_seek_wrapper_class.<locals>.httperror_seek_wrapper: HTTP Error 308: Permanent Redirect During handling of the above exception, another exception occurred: Traceback (most recent call last): File "calibre/web/fetch/simple.py", line 536, in process_links File "calibre/web/fetch/simple.py", line 283, in fetch_url calibre.web.fetch.simple.FetchError: Permanent Redirect https://www.wsj.com/articles/mccarthy-biden-impeachment-inquiry-b9cc6530 saved to Failed to download article: HTTP 308 Direct from https://www.wsj.com/articles/mccarthy-biden-impeachment-inquiry-b9cc6530 Traceback (most recent call last): File "calibre/utils/threadpool.py", line 99, in run File "calibre/web/feeds/news.py", line 1195, in fetch_article File "calibre/web/feeds/news.py", line 1190, in _fetch_article Exception: Could not fetch article. The debug traceback is available earlier in this log |
09-13-2023, 09:19 AM | #4 |
creator of calibre
Posts: 43,966
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
09-15-2023, 11:38 AM | #5 |
Connoisseur
Posts: 50
Karma: 10
Join Date: Oct 2018
Device: kindle
|
I'm having the same issue. I use the latest calibre binary (6.26.0) on linux but still had the same error "HTTP Error 308: Permanent Redirect" when converting wsj recipe.
|
Advert | |
|
09-15-2023, 12:13 PM | #6 |
creator of calibre
Posts: 43,966
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
yes why wouldnt you, since the fix has not yet been released.
|
09-16-2023, 02:22 AM | #7 |
Evangelist
Posts: 459
Karma: 82692
Join Date: May 2021
Device: kindle
|
looks like WSJ articles wont be loading text even with the redirect fix. The amp version of the links have stopped loading content.
maybe we are back to trying this https://www.mobileread.com/forums/sh...4&postcount=17 EDIT: oh wait is it going to start working from the next update after the redirect is fixed? https://github.com/unkn0w7n/calibre/...f7c539a973af9f Last edited by unkn0wn; 09-16-2023 at 02:59 AM. |
09-16-2023, 03:29 AM | #8 |
creator of calibre
Posts: 43,966
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I am still getting some content in wsj free I dont subscribe so cant try the main recipe. See https://github.com/kovidgoyal/calibr...5305f9a1b4e721
|
09-16-2023, 03:30 AM | #9 | |
creator of calibre
Posts: 43,966
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Quote:
|
|
09-17-2023, 06:08 PM | #10 |
Junior Member
Posts: 5
Karma: 10
Join Date: Feb 2017
Device: Kindle Voyage
|
|
09-17-2023, 09:49 PM | #11 |
creator of calibre
Posts: 43,966
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
No no one has sent me the creds, PM them to me.
|
09-18-2023, 01:46 AM | #12 |
Connoisseur
Posts: 50
Karma: 10
Join Date: Oct 2018
Device: kindle
|
|
09-18-2023, 02:39 AM | #13 |
creator of calibre
Posts: 43,966
Karma: 22669822
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
I tested it and sadly only the headlines, hero image and first para are present, as I said before the rest of the content is transmitted encrypted and decrypted on client. Looks like the AMP loophole @unknown found no longer works.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Does calibre support retry-after http headers ? | SimonMc | Library Management | 6 | 12-15-2021 11:40 AM |
table of contents redirects to front page | Eriks | Conversion | 2 | 10-01-2014 12:45 PM |
What are: url:http|// ... urn:urn|uuid| ... uri:http|// | 44reader | Library Management | 5 | 07-05-2012 01:42 PM |