10-11-2024, 03:10 AM | #1 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
[SOLVED] Download web page and turn it into EPUB?
Hello,
I'd like to download a web page that's too long to read on a computer, and have it turned into an EPUB file. Neither Pandoc, Calibre, nor mutool work, either because of wrong layout, wrong characters (ligatures at least), or even "Couldn't render this page". I assumed turning HTML into EPUB (where pages are actually HTML) would be a breeze… but it looks more involved than expected. Does someone know of a reliable, no-brainer solution (for Windows, CLI and/or GUI)? Thank you. Last edited by Shohreh; 10-11-2024 at 11:23 AM. |
10-11-2024, 03:36 AM | #2 |
Wizard
Posts: 1,353
Karma: 6794938
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
|
You should be able to right click on the start of the text and select Inspect.
Then, scroll down the <div> until you hit the one that causes all the text to turn blue Then context menu > Copy > Inner HTML. |
Advert | |
|
10-11-2024, 03:38 AM | #3 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
Thanks. Is there no easier solution?
Google didn't help finding the right options to tell wget to download a web page with what's required to then turn it into a readable EPUB file, even with a no-thrill, single-column web page. |
10-11-2024, 04:10 AM | #4 |
Wizard
Posts: 1,353
Karma: 6794938
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
|
There probably is, but I am not aware of them.
|
10-11-2024, 06:24 AM | #5 |
A Hairy Wizard
Posts: 3,220
Karma: 19000635
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
It’s not necessarily easier in the short run, but you can learn python (a very easy/straight forward programming language) and you can make a short spider scraper that will grab all the html from all the pages and put it in a text file. Then you just need to do some simple massaging to format it as epub.
In the long run you will know python and be the ruler of your universe!! In the medium run make sure you follow copyright restrictions and/or have the website/author’s permission before you scrape those pages. |
Advert | |
|
10-11-2024, 06:28 AM | #6 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
I know Python, but surely applications like wget etc. can download a web page and its resources (CSS, JPG/PNG) before feeding them into Pandoc etc. to get an EPUB?
|
10-11-2024, 06:43 AM | #7 |
Resident Curmudgeon
Posts: 76,358
Karma: 136006198
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
What is URL some maybe someone can have a go at it?
|
10-11-2024, 06:48 AM | #8 |
A Hairy Wizard
Posts: 3,220
Karma: 19000635
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
The issue is parsing the html to get just the book and not all the fluff/ads. Those ads are likely what is causing the issues. Soup and the script can do all the scraping and 99% of the massaging to output a text file with all the book contents and associated html tags. Then just copy/paste the contents of the output file into pandoc/sigil/calibre for final epub massaging.
I wrote a program to do all that as a project to learn python and made a gui for it. That was fun! However, there aren’t any websites that I’m aware of which allow its use. You are pretty much restricted to converting your own webpage to an epub. Last edited by Turtle91; 10-11-2024 at 06:52 AM. |
10-11-2024, 06:52 AM | #9 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
Any web page will do, it's not a specific page.
Ads are not the problem. The problem is 1) getting the CSS and pictures, and 2) turning those into a working EPUB. For different reasons (chopped page, wrong characters, "Couldn't render this page" in one section), neither pandoc, Calibre nor Sigil worked. Surely, I'm not the first person to want to turn a long web article into an EPUB file to read on an e-reader. |
10-11-2024, 07:17 AM | #10 |
A Hairy Wizard
Posts: 3,220
Karma: 19000635
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
You can parse/soup to get the src url for images/css and download those files separately. If there are only a few files it’d be faster to do them manually via the inspector as Karellen mentioned.
You’re not the first to want that. As I mentioned, the program is not difficult to make, but since there are very few opportunities for legal use there isn’t a big incentive for a company to make one publicly available. Some browsers allow you to saveas if you don’t mind getting the whole page including all the junk. But most of that junk doesn’t work in an epub and needs to be cleaned out. Last edited by Turtle91; 10-11-2024 at 07:22 AM. |
10-11-2024, 07:24 AM | #11 |
want to learn what I want
Posts: 1,252
Karma: 6426810
Join Date: Sep 2020
Device: Calibre E-book viewer
|
this firefox add-on works fine in most cases, I love it.
https://addons.mozilla.org/en-US/fir...n/saveasebook/ there's also epubpress https://addons.mozilla.org/en-US/fir...e-web-offline/ |
10-11-2024, 08:02 AM | #12 |
Guru
Posts: 674
Karma: 4568205
Join Date: Jan 2010
Location: Sweden
Device: Kobo Forma
|
I miss bloxp.... I typically use Pocket. Works most of the time, but a bit too often not.
|
10-11-2024, 10:20 AM | #13 |
Connoisseur
Posts: 67
Karma: 143132
Join Date: Sep 2010
Device: Kindle Keyboard 3G
|
Three Suggestions
Hello,
1. with Firefox: Use the add-on Readability based Reader View https://addons.mozilla.org/en-US/fir...ed-reader-view Open (and edit) the page with the Reader View and save the webpage with the save button. The saved html file edit with the Calibre E-Book-Editor. The missing pictures can be downloaded in the editor with Tools > External Links > Download external ressources. Another solution would be to use SingleFile (https://addons.mozilla.org/en-US/fir...n/single-file/). It would save all of the page or the selected to one (big) html file with the images. 2. for Google chrome based browser: There is an add-on rePub - especially for Remarkable, but you can also create simple epub without Remarkable. https://chromewebstore.google.com/de...cgdapmikoaolpb 3. If you use the Pale Moon browser and you have installed the Classic Add-on Archive you can install the add-on GrabMyBooks. This is the solution I use. Greetings, Maria |
10-11-2024, 11:22 AM | #14 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
Thanks much. rePub in Chrome is perfect.
FWIW, the following command in wget is pretty close to download a web page and its resources, but URLs still need to be post-edited to remove the garbage added after picture filenames (eg. .jpeg becomes .jpeg?blah, causing errors): Code:
wget --restrict-file-names=ascii,windows --convert-links --random-wait -U mozilla -e robots=off --span-hosts --domains=acme.com,cdn.acme.com --page-requisites --no-parent --directory-prefix=.\mydir https://acme.com/2024/09/22/blah.html Edit: I also noted that the URLs of some pictures were not converted to point to a local file so won't be displayed in the EPUB. Also, SumatraPDF didn't like some useless <div> section in the EPUB created by Pandoc ("Couldn't render the page"; didn't try to see if it worked in the e-reader). Bottom line: First try one of the browser extensions before trying pandoc (or wget + Sigil/Calibre). Last edited by Shohreh; 10-12-2024 at 04:16 AM. |
10-11-2024, 11:32 AM | #15 | |
Grand Sorcerer
Posts: 12,738
Karma: 75000000
Join Date: Nov 2007
Location: Toronto
Device: Libra H2O, Libra Colour
|
Quote:
It's actually https://chromewebstore.google.com/de...gdapmikoaolpbl |
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Forma Navigate web page with the page turn button on Kobo Forma? | labose123 | Kobo Reader | 2 | 08-17-2020 02:18 AM |
Download and convert web page | nkormanik | Conversion | 15 | 01-12-2019 09:14 PM |
Creating a web page to download .mobi files to Kindle | Steve00932 | Amazon Kindle | 15 | 12-02-2011 01:36 PM |
PRS-300 Lost Symbol ePub: 12 Seconds to turn a page | budbrainmegademo | Sony Reader | 16 | 11-06-2009 07:34 PM |