Blank pages (empty articles) in custom recipe

xiatian · 10-19-2018, 10:21 PM

Hi, guys
I've written a recipe (inherited from BasicNewsRecipe) to fetch some articles online, but when I converted my recipe to ebooks, I only got titles and links and no contents at all. After searching for a while, it seems that I should define user_agent in "get_browser". This has partly solved the problem. But still, some articles are still empty. Any ideas?
Thank you!

kovidgoyal · 10-20-2018, 12:06 AM

are you using auto_cleanup? If so try turning it off.

xiatian · 10-20-2018, 10:41 AM

No, I didn't use auto_cleanup. Here is my testing custome recipe in the attachment.
You'll be asked to input an article link. Please use this article link: http://www.theworldin.com/edition/20...endulum-swings. And you may get an empty article. But it seems links from other sites can do.

kovidgoyal · 10-20-2018, 09:36 PM

The server you are contacting is failing, probably ecause it needs some cookies set or something similar. Add this to your recipe to check:

Code:

    def preprocess_raw_html(self, html, url):
        with open('/t/raw.html', 'wb') as f:
            f.write(html.encode('utf-8'))
        return html

change the '/t/raw.html' above to some path on your computer and open the resulting raw.html after the download to see what actual html the servr is sending.

xiatian · 10-21-2018, 01:30 AM

I got this raw html:

Quote:

What happened?

kovidgoyal · 10-21-2018, 01:32 AM

only the person running the server can tell you that.

kovidgoyal · 10-21-2018, 01:33 AM

most liekly it is using javascript to load content

xiatian · 10-21-2018, 01:37 AM

If so, is there no way to work around this?

kovidgoyal · 10-21-2018, 01:37 AM

no easy way. you would basically need to figure out what requests the javascript is making to load the actual content and make those requests manually in the recipe.

xiatian · 10-21-2018, 01:48 AM

Can calibre support Selenium to fetch web pages so that I can work around js?

xiatian · 10-21-2018, 02:22 AM

I think it would be great if get_browser supports selenium. Is this possible?

kovidgoyal · 10-21-2018, 02:22 AM

no, I'm afraid not.

10-19-2018, 10:21 PM	#1
xiatian Connoisseur Posts: 55 Karma: 10 Join Date: Oct 2018 Device: kindle	Blank pages (empty articles) in custom recipe Hi, guys I've written a recipe (inherited from BasicNewsRecipe) to fetch some articles online, but when I converted my recipe to ebooks, I only got titles and links and no contents at all. After searching for a while, it seems that I should define user_agent in "get_browser". This has partly solved the problem. But still, some articles are still empty. Any ideas? Thank you!

10-20-2018, 09:36 PM	#4
kovidgoyal creator of calibre Posts: 46,368 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The server you are contacting is failing, probably ecause it needs some cookies set or something similar. Add this to your recipe to check: Code: def preprocess_raw_html(self, html, url): with open('/t/raw.html', 'wb') as f: f.write(html.encode('utf-8')) return html change the '/t/raw.html' above to some path on your computer and open the resulting raw.html after the download to see what actual html the servr is sending.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
All pages empty after converting epub in Calibre	Apostrophe	Conversion	1	01-29-2015 10:08 AM
Previously downloaded articles & empty editions	paipa	Recipes	2	11-03-2013 01:20 PM
Financial Times recipe downloading slowly, empty pages	mapex	Recipes	34	06-06-2013 06:27 AM
InDesign to Epub (empty pages)	PauloCoe	EPUBReader	1	06-22-2011 08:56 AM
Reversing articles order in a custom news recipe?	retired_anon_25	Calibre	5	12-12-2009 05:24 PM

10-20-2018, 12:06 AM	#2
kovidgoyal creator of calibre Posts: 46,368 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	are you using auto_cleanup? If so try turning it off.

10-21-2018, 01:32 AM	#6
kovidgoyal creator of calibre Posts: 46,368 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	only the person running the server can tell you that.

10-21-2018, 01:33 AM	#7
kovidgoyal creator of calibre Posts: 46,368 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	most liekly it is using javascript to load content

10-21-2018, 01:37 AM	#8
xiatian Connoisseur Posts: 55 Karma: 10 Join Date: Oct 2018 Device: kindle	If so, is there no way to work around this?

10-21-2018, 01:48 AM	#10
xiatian Connoisseur Posts: 55 Karma: 10 Join Date: Oct 2018 Device: kindle	Can calibre support Selenium to fetch web pages so that I can work around js?

10-21-2018, 02:22 AM	#11
xiatian Connoisseur Posts: 55 Karma: 10 Join Date: Oct 2018 Device: kindle	I think it would be great if get_browser supports selenium. Is this possible?

Advert

Advert