Download web page and turn it into EPUB?

Shohreh · 10-11-2024, 02:10 AM

Hello,

I'd like to download a web page that's too long to read on a computer, and have it turned into an EPUB file.

Neither Pandoc, Calibre, nor mutool work, either because of wrong layout, wrong characters (ligatures at least), or even "Couldn't render this page".

I assumed turning HTML into EPUB (where pages are actually HTML) would be a breeze… but it looks more involved than expected.

Does someone know of a reliable, no-brainer solution (for Windows, CLI and/or GUI)?

Thank you.

Karellen · 10-11-2024, 02:36 AM

You should be able to right click on the start of the text and select Inspect.
Then, scroll down the <div> until you hit the one that causes all the text to turn blue
Then context menu > Copy > Inner HTML.

Shohreh · 10-11-2024, 02:38 AM

Thanks. Is there no easier solution?

Google didn't help finding the right options to tell wget to download a web page with what's required to then turn it into a readable EPUB file, even with a no-thrill, single-column web page.

Karellen · 10-11-2024, 03:10 AM

There probably is, but I am not aware of them.

Turtle91 · 10-11-2024, 05:24 AM

It’s not necessarily easier in the short run, but you can learn python (a very easy/straight forward programming language) and you can make a short spider scraper that will grab all the html from all the pages and put it in a text file. Then you just need to do some simple massaging to format it as epub.

In the long run you will know python and be the ruler of your universe!!

In the medium run make sure you follow copyright restrictions and/or have the website/author’s permission before you scrape those pages.

Shohreh · 10-11-2024, 05:28 AM

I know Python, but surely applications like wget etc. can download a web page and its resources (CSS, JPG/PNG) before feeding them into Pandoc etc. to get an EPUB?

JSWolf · 10-11-2024, 05:43 AM

What is URL some maybe someone can have a go at it?

Turtle91 · 10-11-2024, 05:48 AM

The issue is parsing the html to get just the book and not all the fluff/ads. Those ads are likely what is causing the issues. Soup and the script can do all the scraping and 99% of the massaging to output a text file with all the book contents and associated html tags. Then just copy/paste the contents of the output file into pandoc/sigil/calibre for final epub massaging.

I wrote a program to do all that as a project to learn python and made a gui for it. That was fun! However, there aren’t any websites that I’m aware of which allow its use. You are pretty much restricted to converting your own webpage to an epub.

Shohreh · 10-11-2024, 05:52 AM

Any web page will do, it's not a specific page.

Ads are not the problem. The problem is 1) getting the CSS and pictures, and 2) turning those into a working EPUB.

For different reasons (chopped page, wrong characters, "Couldn't render this page" in one section), neither pandoc, Calibre nor Sigil worked.

Surely, I'm not the first person to want to turn a long web article into an EPUB file to read on an e-reader.

Turtle91 · 10-11-2024, 06:17 AM

You can parse/soup to get the src url for images/css and download those files separately. If there are only a few files it’d be faster to do them manually via the inspector as Karellen mentioned.

You’re not the first to want that. As I mentioned, the program is not difficult to make, but since there are very few opportunities for legal use there isn’t a big incentive for a company to make one publicly available.

Some browsers allow you to saveas if you don’t mind getting the whole page including all the junk. But most of that junk doesn’t work in an epub and needs to be cleaned out.

Comfy.n · 10-11-2024, 06:24 AM

this firefox add-on works fine in most cases, I love it.

https://addons.mozilla.org/en-US/fir...n/saveasebook/

there's also epubpress https://addons.mozilla.org/en-US/fir...e-web-offline/

patrik · 10-11-2024, 07:02 AM

I miss bloxp.... I typically use Pocket. Works most of the time, but a bit too often not.

msel · 10-11-2024, 09:20 AM

Hello,

1. with Firefox: Use the add-on Readability based Reader View
https://addons.mozilla.org/en-US/fir...ed-reader-view
Open (and edit) the page with the Reader View and save the webpage with the save button.
The saved html file edit with the Calibre E-Book-Editor. The missing pictures can be downloaded in the editor with Tools > External Links > Download external ressources.
Another solution would be to use SingleFile (https://addons.mozilla.org/en-US/fir...n/single-file/). It would save all of the page or the selected to one (big) html file with the images.
2. for Google chrome based browser: There is an add-on rePub - especially for Remarkable, but you can also create simple epub without Remarkable.
https://chromewebstore.google.com/de...cgdapmikoaolpb
3. If you use the Pale Moon browser and you have installed the Classic Add-on Archive you can install the add-on GrabMyBooks. This is the solution I use.

Greetings, Maria

Shohreh · 10-11-2024, 10:22 AM

Thanks much. rePub in Chrome is perfect.

FWIW, the following command in wget is pretty close to download a web page and its resources, but URLs still need to be post-edited to remove the garbage added after picture filenames (eg. .jpeg becomes .jpeg?blah, causing errors):

Code:

wget --restrict-file-names=ascii,windows --convert-links  --random-wait -U mozilla -e robots=off --span-hosts --domains=acme.com,cdn.acme.com --page-requisites --no-parent --directory-prefix=.\mydir https://acme.com/2024/09/22/blah.html

---
Edit: I also noted that the URLs of some pictures were not converted to point to a local file so won't be displayed in the EPUB. Also, SumatraPDF didn't like some useless <div> section in the EPUB created by Pandoc ("Couldn't render the page"; didn't try to see if it worked in the e-reader). Bottom line: First try one of the browser extensions before trying pandoc (or wget + Sigil/Calibre).

PeterT · 10-11-2024, 10:32 AM

Quote:

Originally Posted by msel

Hello,

1. with Firefox: Use the add-on Readability based Reader View
https://addons.mozilla.org/en-US/fir...ed-reader-view
Open (and edit) the page with the Reader View and save the webpage with the save button.
The saved html file edit with the Calibre E-Book-Editor. The missing pictures can be downloaded in the editor with Tools > External Links > Download external ressources.
Another solution would be to use SingleFile (https://addons.mozilla.org/en-US/fir...n/single-file/). It would save all of the page or the selected to one (big) html file with the images.
2. for Google chrome based browser: There is an add-on rePub - especially for Remarkable, but you can also create simple epub without Remarkable.
https://chromewebstore.google.com/de...cgdapmikoaolpb
3. If you use the Pale Moon browser and you have installed the Classic Add-on Archive you can install the add-on GrabMyBooks. This is the solution I use.

Greetings, Maria

Missing a character from the URL

It's actually https://chromewebstore.google.com/de...gdapmikoaolpbl

10-11-2024, 02:10 AM	#1
Shohreh Addict Posts: 231 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	[SOLVED] Download web page and turn it into EPUB? Hello, I'd like to download a web page that's too long to read on a computer, and have it turned into an EPUB file. Neither Pandoc, Calibre, nor mutool work, either because of wrong layout, wrong characters (ligatures at least), or even "Couldn't render this page". I assumed turning HTML into EPUB (where pages are actually HTML) would be a breeze… but it looks more involved than expected. Does someone know of a reliable, no-brainer solution (for Windows, CLI and/or GUI)? Thank you. Last edited by Shohreh; 10-11-2024 at 10:23 AM.

10-11-2024, 02:36 AM	#2
Karellen Wizard Posts: 1,794 Karma: 9501034 Join Date: Sep 2021 Location: Australia Device: Kobo Libra 2	You should be able to right click on the start of the text and select Inspect. Then, scroll down the <div> until you hit the one that causes all the text to turn blue Then context menu > Copy > Inner HTML. Attached Thumbnails

10-11-2024, 05:48 AM	#8
Turtle91 A Hairy Wizard Posts: 3,467 Karma: 21000001 Join Date: Dec 2012 Location: Charleston, SC today Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire	The issue is parsing the html to get just the book and not all the fluff/ads. Those ads are likely what is causing the issues. Soup and the script can do all the scraping and 99% of the massaging to output a text file with all the book contents and associated html tags. Then just copy/paste the contents of the output file into pandoc/sigil/calibre for final epub massaging. I wrote a program to do all that as a project to learn python and made a gui for it. That was fun! However, there aren’t any websites that I’m aware of which allow its use. You are pretty much restricted to converting your own webpage to an epub. Last edited by Turtle91; 10-11-2024 at 05:52 AM.

10-11-2024, 06:17 AM	#10
Turtle91 A Hairy Wizard Posts: 3,467 Karma: 21000001 Join Date: Dec 2012 Location: Charleston, SC today Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire	You can parse/soup to get the src url for images/css and download those files separately. If there are only a few files it’d be faster to do them manually via the inspector as Karellen mentioned. You’re not the first to want that. As I mentioned, the program is not difficult to make, but since there are very few opportunities for legal use there isn’t a big incentive for a company to make one publicly available. Some browsers allow you to saveas if you don’t mind getting the whole page including all the junk. But most of that junk doesn’t work in an epub and needs to be cleaned out. Last edited by Turtle91; 10-11-2024 at 06:22 AM.

10-11-2024, 09:20 AM	#13
msel Connoisseur Posts: 77 Karma: 143336 Join Date: Sep 2010 Device: Kindle Keyboard 3G	Three Suggestions Hello, 1. with Firefox: Use the add-on Readability based Reader View https://addons.mozilla.org/en-US/fir...ed-reader-view Open (and edit) the page with the Reader View and save the webpage with the save button. The saved html file edit with the Calibre E-Book-Editor. The missing pictures can be downloaded in the editor with Tools > External Links > Download external ressources. Another solution would be to use SingleFile (https://addons.mozilla.org/en-US/fir...n/single-file/). It would save all of the page or the selected to one (big) html file with the images. 2. for Google chrome based browser: There is an add-on rePub - especially for Remarkable, but you can also create simple epub without Remarkable. https://chromewebstore.google.com/de...cgdapmikoaolpb 3. If you use the Pale Moon browser and you have installed the Classic Add-on Archive you can install the add-on GrabMyBooks. This is the solution I use. Greetings, Maria

10-11-2024, 02:38 AM	#3
Shohreh Addict Posts: 231 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Thanks. Is there no easier solution? Google didn't help finding the right options to tell wget to download a web page with what's required to then turn it into a readable EPUB file, even with a no-thrill, single-column web page.

10-11-2024, 03:10 AM	#4
Karellen Wizard Posts: 1,794 Karma: 9501034 Join Date: Sep 2021 Location: Australia Device: Kobo Libra 2	There probably is, but I am not aware of them.

10-11-2024, 05:24 AM	#5
Turtle91 A Hairy Wizard Posts: 3,467 Karma: 21000001 Join Date: Dec 2012 Location: Charleston, SC today Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire	It’s not necessarily easier in the short run, but you can learn python (a very easy/straight forward programming language) and you can make a short spider scraper that will grab all the html from all the pages and put it in a text file. Then you just need to do some simple massaging to format it as epub. In the long run you will know python and be the ruler of your universe!! In the medium run make sure you follow copyright restrictions and/or have the website/author’s permission before you scrape those pages.

10-11-2024, 05:28 AM	#6
Shohreh Addict Posts: 231 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	I know Python, but surely applications like wget etc. can download a web page and its resources (CSS, JPG/PNG) before feeding them into Pandoc etc. to get an EPUB?

10-11-2024, 05:43 AM	#7
JSWolf Resident Curmudgeon Posts: 82,526 Karma: 151278869 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	What is URL some maybe someone can have a go at it?

10-11-2024, 05:52 AM	#9
Shohreh Addict Posts: 231 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Any web page will do, it's not a specific page. Ads are not the problem. The problem is 1) getting the CSS and pictures, and 2) turning those into a working EPUB. For different reasons (chopped page, wrong characters, "Couldn't render this page" in one section), neither pandoc, Calibre nor Sigil worked. Surely, I'm not the first person to want to turn a long web article into an EPUB file to read on an e-reader.

10-11-2024, 06:24 AM	#11
Comfy.n want to learn what I want Posts: 2,045 Karma: 7955899 Join Date: Sep 2020 Device: none	this firefox add-on works fine in most cases, I love it. https://addons.mozilla.org/en-US/fir...n/saveasebook/ there's also epubpress https://addons.mozilla.org/en-US/fir...e-web-offline/

10-11-2024, 07:02 AM	#12
patrik Guru Posts: 690 Karma: 4568205 Join Date: Jan 2010 Location: Sweden Device: Kobo Forma	I miss bloxp.... I typically use Pocket. Works most of the time, but a bit too often not.

10-11-2024, 10:22 AM	#14
Shohreh Addict Posts: 231 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Thanks much. rePub in Chrome is perfect. FWIW, the following command in wget is pretty close to download a web page and its resources, but URLs still need to be post-edited to remove the garbage added after picture filenames (eg. .jpeg becomes .jpeg?blah, causing errors): Code: wget --restrict-file-names=ascii,windows --convert-links --random-wait -U mozilla -e robots=off --span-hosts --domains=acme.com,cdn.acme.com --page-requisites --no-parent --directory-prefix=.\mydir https://acme.com/2024/09/22/blah.html --- Edit: I also noted that the URLs of some pictures were not converted to point to a local file so won't be displayed in the EPUB. Also, SumatraPDF didn't like some useless <div> section in the EPUB created by Pandoc ("Couldn't render the page"; didn't try to see if it worked in the e-reader). Bottom line: First try one of the browser extensions before trying pandoc (or wget + Sigil/Calibre). Last edited by Shohreh; 10-12-2024 at 03:16 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Forma Navigate web page with the page turn button on Kobo Forma?	labose123	Kobo Reader	2	08-17-2020 01:18 AM
Download and convert web page	nkormanik	Conversion	15	01-12-2019 08:14 PM
Creating a web page to download .mobi files to Kindle	Steve00932	Amazon Kindle	15	12-02-2011 12:36 PM
PRS-300 Lost Symbol ePub: 12 Seconds to turn a page	budbrainmegademo	Sony Reader	16	11-06-2009 06:34 PM

Advert

Advert