Quick and easy way to turn a website into a book?

bounce · 06-16-2019, 01:38 PM

Say I’ve used a tool to download my whole website and I have a bunch of .html files. I then want to turn these files into a pdf book with each page of the website linked so I can read it offline. These files don’t necessarily need to be in any order, though it would be nice if the file structure was the same. What’s the easiest way to do this in Calibre? Or is there another (OSX or online) tool appropriate for the job?

skb · 06-16-2019, 06:30 PM

I've done this to a lesser degree.

In theory, you could download HTML files from a site (using something like SiteSucker).

However, depending on the site, there's usually LOTS going on - ads etc. And a lot (most?) sites these days aren't static HTML but rather generated with a CMS (like this site).

Anyway, if the HTML is vanilla enough, you could download the site. Then, I would create a new library* (especially if there's gazillions of pages) and import them into the blank library.Then convert them into epubs. Then, using the ePubMerge, merge them. Once you've got a Merged ebook, you can move it into your "normal" library (if you wish).

Having said all that, there is lots that can wrong. I would convert one HTML file and view it and check that it's actually readable.

To be honest, I post-process any HTML I import into Calibre: to remove menus, ads, images, formatting, styles etc etc. So, I try not to do it often.

That's how I'd do it - there may well be a scripted or easier way but my programming skills are waaaaay out of date.

Good luck!

* I create a new/temp library because I don't want to miss/overlook a file etc and it's a way of quarantining. I usually delete my temp library after this sort of thing. You mileage may vary.

bounce · 06-18-2019, 06:16 PM

I had used site sucker and was thinking to import the html into bbedit or textwrangler, then strip then html out automatically, then combine and turn into a pdf. Wouldn't keep the formatting and wouldn't be pretty.
Thanks for the pointers with merging with epubmerge, will try that later.

skb · 06-18-2019, 06:22 PM

I'm sure I had an app (retired, so is brain) that removed the fluffy HTML but left headings, bold, italic. I may have imagined it.

However, for removing ALL HTML, I use Clean Text. I'm not sure it does batch jobs though. Clean Text is excellent and does what it says on the tin.

I wonder if something like Brackets would clean up the HTML but not remove it completely?

Sorry I can't be more help. I feel your pain.

Edit: doesn't look like Brackets is helpful in this case...

06-16-2019, 01:38 PM	#1
bounce Zealot Posts: 137 Karma: 13892 Join Date: Mar 2010 Device: Ipad, Kindle Paperwhite 11	Quick and easy way to turn a website into a book? Say I’ve used a tool to download my whole website and I have a bunch of .html files. I then want to turn these files into a pdf book with each page of the website linked so I can read it offline. These files don’t necessarily need to be in any order, though it would be nice if the file structure was the same. What’s the easiest way to do this in Calibre? Or is there another (OSX or online) tool appropriate for the job?

06-18-2019, 06:22 PM	#4
skb Evangelist Posts: 401 Karma: 1597305 Join Date: Mar 2010 Device: Ipod G4, MacOS 10.12, Calibre, Pocketbook Touch HD 3	I'm sure I had an app (retired, so is brain) that removed the fluffy HTML but left headings, bold, italic. I may have imagined it. However, for removing ALL HTML, I use Clean Text. I'm not sure it does batch jobs though. Clean Text is excellent and does what it says on the tin. I wonder if something like Brackets would clean up the HTML but not remove it completely? Sorry I can't be more help. I feel your pain. Edit: doesn't look like Brackets is helpful in this case... Last edited by skb; 06-18-2019 at 06:24 PM. Reason: Brain snap

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Easy copy from website?	trianglekitty	Library Management	3	07-04-2012 09:31 PM
Quick and Easy eBook Landing Page	mintotsai	Writers' Corner	2	05-16-2012 02:02 AM
SonyFont - quick and easy font updater	pepak	Sony Reader Dev Corner	19	06-13-2010 06:35 AM
Quick/easy LIT to LRF converter?	OUTATIME	Sony Reader Dev Corner	10	02-29-2008 10:44 AM
Quick and Easy Diary	carandol	iRex	3	02-22-2008 05:26 PM

06-16-2019, 06:30 PM	#2
skb Evangelist Posts: 401 Karma: 1597305 Join Date: Mar 2010 Device: Ipod G4, MacOS 10.12, Calibre, Pocketbook Touch HD 3	I've done this to a lesser degree. In theory, you could download HTML files from a site (using something like SiteSucker). However, depending on the site, there's usually LOTS going on - ads etc. And a lot (most?) sites these days aren't static HTML but rather generated with a CMS (like this site). Anyway, if the HTML is vanilla enough, you could download the site. Then, I would create a new library* (especially if there's gazillions of pages) and import them into the blank library.Then convert them into epubs. Then, using the ePubMerge, merge them. Once you've got a Merged ebook, you can move it into your "normal" library (if you wish). Having said all that, there is lots that can wrong. I would convert one HTML file and view it and check that it's actually readable. To be honest, I post-process any HTML I import into Calibre: to remove menus, ads, images, formatting, styles etc etc. So, I try not to do it often. That's how I'd do it - there may well be a scripted or easier way but my programming skills are waaaaay out of date. Good luck! * I create a new/temp library because I don't want to miss/overlook a file etc and it's a way of quarantining. I usually delete my temp library after this sort of thing. You mileage may vary.

06-18-2019, 06:16 PM	#3
bounce Zealot Posts: 137 Karma: 13892 Join Date: Mar 2010 Device: Ipad, Kindle Paperwhite 11	I had used site sucker and was thinking to import the html into bbedit or textwrangler, then strip then html out automatically, then combine and turn into a pdf. Wouldn't keep the formatting and wouldn't be pretty. Thanks for the pointers with merging with epubmerge, will try that later.

Advert