Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 02-05-2012, 12:06 AM   #1
cptnemo
Enthusiast
cptnemo began at the beginning.
 
Posts: 35
Karma: 10
Join Date: Oct 2011
Device: Kindle 3
Converting website to epub

Hello,

I have tried during to few days to succesfully convert a website that I download to epub. But with no success.

This is what I have tried.

1) I mirrored the site with HTTrack. Job done, now my site is happily sitting in my hard drive divided in one subfolder for every html page: in total 1300 pages, 99,99% pure text very few immages.

2) I tried to import the web site in Calibre following the instruction given on Calibre's Guide (import the content page then convert to epub).
PROBLEM: Calibre does its job by moving all the html pages in one folder (the Text folder) changing the name of the pages but without changing internal links. That is, all the links connetting one page to the other within the site/epub are broken.
What steps should I follow to mantain my internal links? Doing the work manually is excluded (we are talking about 1300 pages). Do you know any software that will help me in moving and renaming (are all index.html) the page from subfolders to a root folder, without breaking the links?

Thanks!
cptnemo is offline   Reply With Quote
Old 02-05-2012, 12:14 AM   #2
KenJackson
Addict
KenJackson goes to infinity... and beyond!KenJackson goes to infinity... and beyond!KenJackson goes to infinity... and beyond!KenJackson goes to infinity... and beyond!KenJackson goes to infinity... and beyond!KenJackson goes to infinity... and beyond!KenJackson goes to infinity... and beyond!KenJackson goes to infinity... and beyond!KenJackson goes to infinity... and beyond!KenJackson goes to infinity... and beyond!KenJackson goes to infinity... and beyond!
 
Posts: 256
Karma: 112042
Join Date: Oct 2010
Location: Maryland, USA
Device: Sony PRS-650
I haven't tried it yet, but pandoc claims to be able to read HTML and write EPUB, among many other formats.
KenJackson is offline   Reply With Quote
Old 02-05-2012, 05:20 AM   #3
cptnemo
Enthusiast
cptnemo began at the beginning.
 
Posts: 35
Karma: 10
Join Date: Oct 2011
Device: Kindle 3
Quote:
Originally Posted by KenJackson View Post
I haven't tried it yet, but pandoc claims to be able to read HTML and write EPUB, among many other formats.
Mmm, I don't think is what I need. I don't want to bypass Calibre, I just need to arrive to Calibre with a website contained all in one folder. What I need is to move all tpages from the sub directories to the parent directory. And of course, in doing this I need to preserve all the internal links.
cptnemo is offline   Reply With Quote
Old 02-05-2012, 07:26 AM   #4
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
You need to use Sigil to examine how the links are broken. They are likely all broken in each section in a similar way which you can use search and replace to fix, so it might be many fewer fixes. Also spaces in file names or capitalization may be not matching.

Your text folder in epub will need to be broken down in sections anyway because most readers get unhappy if individual sections are too large, in the case of my Sony over 300k.

If you use Sigil to perform this chore, be sure to keep backup copies right along. You might make an error and Sigil will offer to fix it automatically....don't let it. Sigil is often right in its fixing, but when it is wrong, half of what you are working on might disappear. Either fix it if you understand the code view well or just load your last saved version.
mrmikel is offline   Reply With Quote
Old 02-06-2012, 08:49 AM   #5
cptnemo
Enthusiast
cptnemo began at the beginning.
 
Posts: 35
Karma: 10
Join Date: Oct 2011
Device: Kindle 3
Quote:
Originally Posted by mrmikel View Post
You need to use Sigil to examine how the links are broken. They are likely all broken in each section in a similar way which you can use search and replace to fix, so it might be many fewer fixes. Also spaces in file names or capitalization may be not matching.

Your text folder in epub will need to be broken down in sections anyway because most readers get unhappy if individual sections are too large, in the case of my Sony over 300k.

If you use Sigil to perform this chore, be sure to keep backup copies right along. You might make an error and Sigil will offer to fix it automatically....don't let it. Sigil is often right in its fixing, but when it is wrong, half of what you are working on might disappear. Either fix it if you understand the code view well or just load your last saved version.
Yes, but I have 1,300 pages. I can't fix manually every single broken link. So I need some software to do it for me: move the pages and links from subfolder to one parent folder.
cptnemo is offline   Reply With Quote
Old 02-06-2012, 09:22 AM   #6
andyh2000
Avid reader
andyh2000 ought to be getting tired of karma fortunes by now.andyh2000 ought to be getting tired of karma fortunes by now.andyh2000 ought to be getting tired of karma fortunes by now.andyh2000 ought to be getting tired of karma fortunes by now.andyh2000 ought to be getting tired of karma fortunes by now.andyh2000 ought to be getting tired of karma fortunes by now.andyh2000 ought to be getting tired of karma fortunes by now.andyh2000 ought to be getting tired of karma fortunes by now.andyh2000 ought to be getting tired of karma fortunes by now.andyh2000 ought to be getting tired of karma fortunes by now.andyh2000 ought to be getting tired of karma fortunes by now.
 
andyh2000's Avatar
 
Posts: 825
Karma: 6377682
Join Date: Apr 2009
Location: UK
Device: Samsung Galaxy Z Flip 4 / Kindle Paperwhite
Quote:
Originally Posted by cptnemo View Post
Yes, but I have 1,300 pages. I can't fix manually every single broken link. So I need some software to do it for me: move the pages and links from subfolder to one parent folder.
I know you've already mirrored the site so might not want to do it again but I use "wget" to mirror web sites and it has a "no directories" option ("-nd") which stops it creating sub-directories. It also fixes up the internal links so nothing is broken.

The actual command I use is:
Code:
wget.exe -p -k -nd -q -E -R js,txt,css -nc %pg%
which translates as:

get all images, etc. needed to display HTML page
make links in downloaded HTML point to local files
don't create directories
quiet
save HTML documents with `.html' extension
comma-separated list of rejected extensions: js,txt,css
skip downloads that would download to existing files

and you'll need to add "-r" for recursive download (thanks frostschutz for pointing that out below - my usage was for turning a single page into an epub)

You can get it from here: http://gnuwin32.sourceforge.net/packages/wget.htm

Hope this helps

Andrew

Last edited by andyh2000; 02-06-2012 at 09:58 AM. Reason: Oops - forgot the option "-r" for recursive
andyh2000 is offline   Reply With Quote
Old 02-06-2012, 09:22 AM   #7
frostschutz
Linux User
frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.frostschutz ought to be getting tired of karma fortunes by now.
 
frostschutz's Avatar
 
Posts: 2,279
Karma: 6123806
Join Date: Sep 2010
Location: Heidelberg, Germany
Device: none
Try pavuk. Pavuk is a monster when it comes to HTTP downloading.

pavuk can flatten html structure to a degree, unfortunately its options are hard to understand!

For testing I created a web site structure like this:
Code:
./a
./a/a.html
./b
./b/b.html
./c
./c/c.html
./index.html
with HTML source code looking like this

Code:
<html>
<head>
</head>
<body>
a is <a href="../a/a.html">a.html</a>
index is <a href="../index.html">index.html</a>
c is <a href="../c/c.html">c.html</a>
</body>
</html>
The following command tries to flatten up to 8 subdir levels to a single directory structure:

Code:
pavuk -mode mirror -base_level 8 -sel_to_local http://localhost/pavuk/index.html
Result looks like this:

Code:
./index.html
./a.html
./c.html
./b.html
with HTML code like this

Code:
<html>
<head>
</head>
<body>
a is <a href="a.html">a.html</a>
index is <a href="index.html">index.html</a>
c is <a href="c.html">c.html</a>
</body>
</html>
So it's flattened properly into a single directory.

For identical filenames it also renames those files (e.g. a/index.html becomes 001index.html) and fixes the linking properly.

Unfortunately its not clear from the name what its original location was, so the only thing you have to work with is the files linking to each other; so you will probably need some kind of "main index" file that helps Calibre/Sigil getting the content into the right order.


@Andrew: you ninja'd me. But I didn't know that wget had such an option too. wget may be easier to understand than pavuk. You forgot the -r option though, your cmdline does not download subdirs for me. And my wget complains about -nc in conjunction with -nd and uses -nd only. Otherwise it works just as well.

Last edited by frostschutz; 02-06-2012 at 09:26 AM.
frostschutz is offline   Reply With Quote
Old 02-08-2012, 02:13 AM   #8
cptnemo
Enthusiast
cptnemo began at the beginning.
 
Posts: 35
Karma: 10
Join Date: Oct 2011
Device: Kindle 3
Quote:
Originally Posted by andyh2000 View Post
I know you've already mirrored the site so might not want to do it again but I use "wget" to mirror web sites and it has a "no directories" option ("-nd") which stops it creating sub-directories. It also fixes up the internal links so nothing is broken.

The actual command I use is:
Code:
wget.exe -p -k -nd -q -E -R js,txt,css -nc %pg%
It seams what I need. I launched the command in my Windows Terminal (I added also -r) and, now what?

The terminal is saying:
SYSTEM_wcetrc = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files\GnuWin32/etc/wgetrc
With the cursor blinking...

Is downloading something? Where?

Thanks!
cptnemo is offline   Reply With Quote
Old 04-18-2013, 08:26 AM   #9
gennaro
Junior Member
gennaro began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Apr 2013
Device: kindle
just tested: winHTTrack has the option to save all files in the same folder, and it fixes links:
set options>build>local structure type>option are self-explicative
gennaro is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Rules of Civil Procedure epub from website sk1 Recipes 3 01-31-2012 01:53 PM
Website > Ebook : ePub converter? re838uk ePub 9 07-13-2011 08:24 AM
Converting entire website to ePub... sharp21 Conversion 4 05-31-2011 12:00 PM
Epub as a website pittendrigh Introduce Yourself 4 03-29-2011 06:36 AM
epub file website downloads stunev ePub 3 07-23-2010 12:44 PM


All times are GMT -4. The time now is 06:15 AM.


MobileRead.com is a privately owned, operated and funded community.