![]() |
#1 |
Enthusiast
![]() Posts: 35
Karma: 10
Join Date: Oct 2011
Device: Kindle 3
|
Converting website to epub
Hello,
I have tried during to few days to succesfully convert a website that I download to epub. But with no success. This is what I have tried. 1) I mirrored the site with HTTrack. Job done, now my site is happily sitting in my hard drive divided in one subfolder for every html page: in total 1300 pages, 99,99% pure text very few immages. 2) I tried to import the web site in Calibre following the instruction given on Calibre's Guide (import the content page then convert to epub). PROBLEM: Calibre does its job by moving all the html pages in one folder (the Text folder) changing the name of the pages but without changing internal links. That is, all the links connetting one page to the other within the site/epub are broken. What steps should I follow to mantain my internal links? Doing the work manually is excluded (we are talking about 1300 pages). Do you know any software that will help me in moving and renaming (are all index.html) the page from subfolders to a root folder, without breaking the links?Thanks! |
![]() |
![]() |
![]() |
#3 | |
Enthusiast
![]() Posts: 35
Karma: 10
Join Date: Oct 2011
Device: Kindle 3
|
Quote:
|
|
![]() |
![]() |
![]() |
#4 |
Color me gone
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
You need to use Sigil to examine how the links are broken. They are likely all broken in each section in a similar way which you can use search and replace to fix, so it might be many fewer fixes. Also spaces in file names or capitalization may be not matching.
Your text folder in epub will need to be broken down in sections anyway because most readers get unhappy if individual sections are too large, in the case of my Sony over 300k. If you use Sigil to perform this chore, be sure to keep backup copies right along. You might make an error and Sigil will offer to fix it automatically....don't let it. Sigil is often right in its fixing, but when it is wrong, half of what you are working on might disappear. Either fix it if you understand the code view well or just load your last saved version. |
![]() |
![]() |
![]() |
#5 | |
Enthusiast
![]() Posts: 35
Karma: 10
Join Date: Oct 2011
Device: Kindle 3
|
Quote:
|
|
![]() |
![]() |
Advert | |
|
![]() |
#6 | |
Avid reader
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 880
Karma: 6399168
Join Date: Apr 2009
Location: UK
Device: Samsung Galaxy Z Flip 4 / Kindle Paperwhite / TCL Nxtpaper 14
|
Quote:
The actual command I use is: Code:
wget.exe -p -k -nd -q -E -R js,txt,css -nc %pg% get all images, etc. needed to display HTML page make links in downloaded HTML point to local files don't create directories quiet save HTML documents with `.html' extension comma-separated list of rejected extensions: js,txt,css skip downloads that would download to existing files and you'll need to add "-r" for recursive download (thanks frostschutz for pointing that out below - my usage was for turning a single page into an epub) You can get it from here: http://gnuwin32.sourceforge.net/packages/wget.htm Hope this helps Andrew Last edited by andyh2000; 02-06-2012 at 09:58 AM. Reason: Oops - forgot the option "-r" for recursive |
|
![]() |
![]() |
![]() |
#7 |
Linux User
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,282
Karma: 6123806
Join Date: Sep 2010
Location: Heidelberg, Germany
Device: none
|
Try pavuk. Pavuk is a monster when it comes to HTTP downloading.
pavuk can flatten html structure to a degree, unfortunately its options are hard to understand! For testing I created a web site structure like this: Code:
./a ./a/a.html ./b ./b/b.html ./c ./c/c.html ./index.html Code:
<html> <head> </head> <body> a is <a href="../a/a.html">a.html</a> index is <a href="../index.html">index.html</a> c is <a href="../c/c.html">c.html</a> </body> </html> Code:
pavuk -mode mirror -base_level 8 -sel_to_local http://localhost/pavuk/index.html Code:
./index.html ./a.html ./c.html ./b.html Code:
<html> <head> </head> <body> a is <a href="a.html">a.html</a> index is <a href="index.html">index.html</a> c is <a href="c.html">c.html</a> </body> </html> For identical filenames it also renames those files (e.g. a/index.html becomes 001index.html) and fixes the linking properly. Unfortunately its not clear from the name what its original location was, so the only thing you have to work with is the files linking to each other; so you will probably need some kind of "main index" file that helps Calibre/Sigil getting the content into the right order. @Andrew: you ninja'd me. But I didn't know that wget had such an option too. wget may be easier to understand than pavuk. ![]() Last edited by frostschutz; 02-06-2012 at 09:26 AM. |
![]() |
![]() |
![]() |
#8 | |
Enthusiast
![]() Posts: 35
Karma: 10
Join Date: Oct 2011
Device: Kindle 3
|
Quote:
The terminal is saying: SYSTEM_wcetrc = c:/progra~1/wget/etc/wgetrc With the cursor blinking...syswgetrc = C:\Program Files\GnuWin32/etc/wgetrc Is downloading something? Where? Thanks! |
|
![]() |
![]() |
![]() |
#9 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Apr 2013
Device: kindle
|
just tested: winHTTrack has the option to save all files in the same folder, and it fixes links:
set options>build>local structure type>option are self-explicative |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Rules of Civil Procedure epub from website | sk1 | Recipes | 3 | 01-31-2012 01:53 PM |
Website > Ebook : ePub converter? | re838uk | ePub | 9 | 07-13-2011 08:24 AM |
Converting entire website to ePub... | sharp21 | Conversion | 4 | 05-31-2011 12:00 PM |
Epub as a website | pittendrigh | Introduce Yourself | 4 | 03-29-2011 06:36 AM |
epub file website downloads | stunev | ePub | 3 | 07-23-2010 12:44 PM |