Converting website to epub

cptnemo · 02-05-2012, 01:06 AM

Hello,

I have tried during to few days to succesfully convert a website that I download to epub. But with no success.

This is what I have tried.

1) I mirrored the site with HTTrack. Job done, now my site is happily sitting in my hard drive divided in one subfolder for every html page: in total 1300 pages, 99,99% pure text very few immages.

2) I tried to import the web site in Calibre following the instruction given on Calibre's Guide (import the content page then convert to epub).

PROBLEM: Calibre does its job by moving all the html pages in one folder (the Text folder) changing the name of the pages but without changing internal links. That is, all the links connetting one page to the other within the site/epub are broken.

What steps should I follow to mantain my internal links? Doing the work manually is excluded (we are talking about 1300 pages). Do you know any software that will help me in moving and renaming (are all index.html) the page from subfolders to a root folder, without breaking the links?

Thanks!

KenJackson · 02-05-2012, 01:14 AM

I haven't tried it yet, but pandoc claims to be able to read HTML and write EPUB, among many other formats.

cptnemo · 02-05-2012, 06:20 AM

Quote:

Originally Posted by KenJackson

I haven't tried it yet, but pandoc claims to be able to read HTML and write EPUB, among many other formats.

Mmm, I don't think is what I need. I don't want to bypass Calibre, I just need to arrive to Calibre with a website contained all in one folder. What I need is to move all tpages from the sub directories to the parent directory. And of course, in doing this I need to preserve all the internal links.

mrmikel · 02-05-2012, 08:26 AM

You need to use Sigil to examine how the links are broken. They are likely all broken in each section in a similar way which you can use search and replace to fix, so it might be many fewer fixes. Also spaces in file names or capitalization may be not matching.

Your text folder in epub will need to be broken down in sections anyway because most readers get unhappy if individual sections are too large, in the case of my Sony over 300k.

If you use Sigil to perform this chore, be sure to keep backup copies right along. You might make an error and Sigil will offer to fix it automatically....don't let it. Sigil is often right in its fixing, but when it is wrong, half of what you are working on might disappear. Either fix it if you understand the code view well or just load your last saved version.

cptnemo · 02-06-2012, 09:49 AM

Quote:

Originally Posted by mrmikel

You need to use Sigil to examine how the links are broken. They are likely all broken in each section in a similar way which you can use search and replace to fix, so it might be many fewer fixes. Also spaces in file names or capitalization may be not matching.

Your text folder in epub will need to be broken down in sections anyway because most readers get unhappy if individual sections are too large, in the case of my Sony over 300k.

If you use Sigil to perform this chore, be sure to keep backup copies right along. You might make an error and Sigil will offer to fix it automatically....don't let it. Sigil is often right in its fixing, but when it is wrong, half of what you are working on might disappear. Either fix it if you understand the code view well or just load your last saved version.

Yes, but I have 1,300 pages. I can't fix manually every single broken link. So I need some software to do it for me: move the pages and links from subfolder to one parent folder.

andyh2000 · 02-06-2012, 10:22 AM

Quote:

Originally Posted by cptnemo

Yes, but I have 1,300 pages. I can't fix manually every single broken link. So I need some software to do it for me: move the pages and links from subfolder to one parent folder.

I know you've already mirrored the site so might not want to do it again but I use "wget" to mirror web sites and it has a "no directories" option ("-nd") which stops it creating sub-directories. It also fixes up the internal links so nothing is broken.

The actual command I use is:

Code:

wget.exe -p -k -nd -q -E -R js,txt,css -nc %pg%

which translates as:

get all images, etc. needed to display HTML page
make links in downloaded HTML point to local files
don't create directories
quiet
save HTML documents with `.html' extension
comma-separated list of rejected extensions: js,txt,css
skip downloads that would download to existing files

and you'll need to add "-r" for recursive download (thanks frostschutz for pointing that out below - my usage was for turning a single page into an epub)

You can get it from here: http://gnuwin32.sourceforge.net/packages/wget.htm

Hope this helps

Andrew

frostschutz · 02-06-2012, 10:22 AM

Try pavuk. Pavuk is a monster when it comes to HTTP downloading.

pavuk can flatten html structure to a degree, unfortunately its options are hard to understand!

For testing I created a web site structure like this:

Code:

./a
./a/a.html
./b
./b/b.html
./c
./c/c.html
./index.html

with HTML source code looking like this

Code:

<html>
<head>
</head>
<body>
a is <a href="../a/a.html">a.html</a>
index is <a href="../index.html">index.html</a>
c is <a href="../c/c.html">c.html</a>
</body>
</html>

The following command tries to flatten up to 8 subdir levels to a single directory structure:

Code:

pavuk -mode mirror -base_level 8 -sel_to_local http://localhost/pavuk/index.html

Result looks like this:

Code:

./index.html
./a.html
./c.html
./b.html

with HTML code like this

Code:

<html>
<head>
</head>
<body>
a is <a href="a.html">a.html</a>
index is <a href="index.html">index.html</a>
c is <a href="c.html">c.html</a>
</body>
</html>

So it's flattened properly into a single directory.

For identical filenames it also renames those files (e.g. a/index.html becomes 001index.html) and fixes the linking properly.

Unfortunately its not clear from the name what its original location was, so the only thing you have to work with is the files linking to each other; so you will probably need some kind of "main index" file that helps Calibre/Sigil getting the content into the right order.

@Andrew: you ninja'd me. But I didn't know that wget had such an option too. wget may be easier to understand than pavuk.

You forgot the -r option though, your cmdline does not download subdirs for me. And my wget complains about -nc in conjunction with -nd and uses -nd only. Otherwise it works just as well.

cptnemo · 02-08-2012, 03:13 AM

Quote:

Originally Posted by andyh2000

I know you've already mirrored the site so might not want to do it again but I use "wget" to mirror web sites and it has a "no directories" option ("-nd") which stops it creating sub-directories. It also fixes up the internal links so nothing is broken.

The actual command I use is:

Code:

wget.exe -p -k -nd -q -E -R js,txt,css -nc %pg%

It seams what I need. I launched the command in my Windows Terminal (I added also -r) and, now what?

The terminal is saying:

SYSTEM_wcetrc = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files\GnuWin32/etc/wgetrc

With the cursor blinking...

Is downloading something? Where?

Thanks!

gennaro · 04-18-2013, 09:26 AM

just tested: winHTTrack has the option to save all files in the same folder, and it fixes links:
set options>build>local structure type>option are self-explicative

02-05-2012, 01:06 AM	#1
cptnemo Enthusiast Posts: 35 Karma: 10 Join Date: Oct 2011 Device: Kindle 3	Converting website to epub Hello, I have tried during to few days to succesfully convert a website that I download to epub. But with no success. This is what I have tried. 1) I mirrored the site with HTTrack. Job done, now my site is happily sitting in my hard drive divided in one subfolder for every html page: in total 1300 pages, 99,99% pure text very few immages. 2) I tried to import the web site in Calibre following the instruction given on Calibre's Guide (import the content page then convert to epub). PROBLEM: Calibre does its job by moving all the html pages in one folder (the Text folder) changing the name of the pages but without changing internal links. That is, all the links connetting one page to the other within the site/epub are broken. What steps should I follow to mantain my internal links? Doing the work manually is excluded (we are talking about 1300 pages). Do you know any software that will help me in moving and renaming (are all index.html) the page from subfolders to a root folder, without breaking the links? Thanks!

02-06-2012, 10:22 AM	#7
frostschutz Linux User Posts: 2,282 Karma: 6123806 Join Date: Sep 2010 Location: Heidelberg, Germany Device: none	Try pavuk. Pavuk is a monster when it comes to HTTP downloading. pavuk can flatten html structure to a degree, unfortunately its options are hard to understand! For testing I created a web site structure like this: Code: ./a ./a/a.html ./b ./b/b.html ./c ./c/c.html ./index.html with HTML source code looking like this Code: <html> <head> </head> <body> a is <a href="../a/a.html">a.html</a> index is <a href="../index.html">index.html</a> c is <a href="../c/c.html">c.html</a> </body> </html> The following command tries to flatten up to 8 subdir levels to a single directory structure: Code: pavuk -mode mirror -base_level 8 -sel_to_local http://localhost/pavuk/index.html Result looks like this: Code: ./index.html ./a.html ./c.html ./b.html with HTML code like this Code: <html> <head> </head> <body> a is <a href="a.html">a.html</a> index is <a href="index.html">index.html</a> c is <a href="c.html">c.html</a> </body> </html> So it's flattened properly into a single directory. For identical filenames it also renames those files (e.g. a/index.html becomes 001index.html) and fixes the linking properly. Unfortunately its not clear from the name what its original location was, so the only thing you have to work with is the files linking to each other; so you will probably need some kind of "main index" file that helps Calibre/Sigil getting the content into the right order. @Andrew: you ninja'd me. But I didn't know that wget had such an option too. wget may be easier to understand than pavuk. You forgot the -r option though, your cmdline does not download subdirs for me. And my wget complains about -nc in conjunction with -nd and uses -nd only. Otherwise it works just as well. Last edited by frostschutz; 02-06-2012 at 10:26 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Rules of Civil Procedure epub from website	sk1	Recipes	3	01-31-2012 02:53 PM
Website > Ebook : ePub converter?	re838uk	ePub	9	07-13-2011 09:24 AM
Converting entire website to ePub...	sharp21	Conversion	4	05-31-2011 01:00 PM
Epub as a website	pittendrigh	Introduce Yourself	4	03-29-2011 07:36 AM
epub file website downloads	stunev	ePub	3	07-23-2010 01:44 PM

02-05-2012, 01:14 AM	#2
KenJackson Addict Posts: 256 Karma: 112042 Join Date: Oct 2010 Location: Maryland, USA Device: Sony PRS-650	I haven't tried it yet, but pandoc claims to be able to read HTML and write EPUB, among many other formats.

02-05-2012, 08:26 AM	#4
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	You need to use Sigil to examine how the links are broken. They are likely all broken in each section in a similar way which you can use search and replace to fix, so it might be many fewer fixes. Also spaces in file names or capitalization may be not matching. Your text folder in epub will need to be broken down in sections anyway because most readers get unhappy if individual sections are too large, in the case of my Sony over 300k. If you use Sigil to perform this chore, be sure to keep backup copies right along. You might make an error and Sigil will offer to fix it automatically....don't let it. Sigil is often right in its fixing, but when it is wrong, half of what you are working on might disappear. Either fix it if you understand the code view well or just load your last saved version.

04-18-2013, 09:26 AM	#9
gennaro Junior Member Posts: 6 Karma: 10 Join Date: Apr 2013 Device: kindle	just tested: winHTTrack has the option to save all files in the same folder, and it fixes links: set options>build>local structure type>option are self-explicative

Advert

Advert