View Single Post
Old 02-06-2012, 09:22 AM   #7
frostschutz
Linux User
frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.
 
frostschutz's Avatar
 
Posts: 2,282
Karma: 6123806
Join Date: Sep 2010
Location: Heidelberg, Germany
Device: none
Try pavuk. Pavuk is a monster when it comes to HTTP downloading.

pavuk can flatten html structure to a degree, unfortunately its options are hard to understand!

For testing I created a web site structure like this:
Code:
./a
./a/a.html
./b
./b/b.html
./c
./c/c.html
./index.html
with HTML source code looking like this

Code:
<html>
<head>
</head>
<body>
a is <a href="../a/a.html">a.html</a>
index is <a href="../index.html">index.html</a>
c is <a href="../c/c.html">c.html</a>
</body>
</html>
The following command tries to flatten up to 8 subdir levels to a single directory structure:

Code:
pavuk -mode mirror -base_level 8 -sel_to_local http://localhost/pavuk/index.html
Result looks like this:

Code:
./index.html
./a.html
./c.html
./b.html
with HTML code like this

Code:
<html>
<head>
</head>
<body>
a is <a href="a.html">a.html</a>
index is <a href="index.html">index.html</a>
c is <a href="c.html">c.html</a>
</body>
</html>
So it's flattened properly into a single directory.

For identical filenames it also renames those files (e.g. a/index.html becomes 001index.html) and fixes the linking properly.

Unfortunately its not clear from the name what its original location was, so the only thing you have to work with is the files linking to each other; so you will probably need some kind of "main index" file that helps Calibre/Sigil getting the content into the right order.


@Andrew: you ninja'd me. But I didn't know that wget had such an option too. wget may be easier to understand than pavuk. You forgot the -r option though, your cmdline does not download subdirs for me. And my wget complains about -nc in conjunction with -nd and uses -nd only. Otherwise it works just as well.

Last edited by frostschutz; 02-06-2012 at 09:26 AM.
frostschutz is offline   Reply With Quote