HTML file doesn't import to ZIP

webdad · 11-05-2011, 05:07 PM

I'm trying to convert a collection of downloaded HTML pages to an ebook. The pages are downloaded to a directory with corresponding subdirectories for the "complete" portion of each page.

page80.html
page81.html
page80_files <--- Subdirectory
page81_files <--- Subdirectory

I have the following "TOC" HTML file in the same main directory:

<html>
<body>
<h1>Table of Contents</h1>
<p style="text-indent:0pt">
<a href="./page80.html">80</a><br/>
<a href="./page81.html">81</a><br/>
</p>
</body>
</html>

When I drag the TOC file into Calibre, it comes in as just an HTML file. No corresponding ZIP file containing the TOC file, the sub pages, and the sub page contents is created.

When I open the TOC file in the Calibre viewer and examine the href links, they resolve to a temp directory in C:\users\...appdata ... yada yada yada\page80.html, which isn't where the files are truly located.

So, for some reason, Calibre isn't getting the member pages nor the contents of the subdirectories into the zip files. Would an error in the underlying page files' HTML format cause this or any ideas how to find out what the issue is?

I've looked through msgs/thread here but haven't seen this type of issue.

Thanks.

DoctorOhh · 11-05-2011, 08:17 PM

Quote:

Originally Posted by webdad

So, for some reason, Calibre isn't getting the member pages nor the contents of the subdirectories into the zip files. Would an error in the underlying page files' HTML format cause this or any ideas how to find out what the issue is?

Go to Preferences - Plugins - File Type Plugins and make sure the HTML to ZIP plugin is enabled.

webdad · 11-06-2011, 10:34 AM

The plug-in is green. I toggled it off to gray / back on to green and deleted my original doc.

When I added it back in, same result. Just an HTML file.

This is Calibre portable running from a disk where the docs are stored (which isn't my C drive). The Calibre libray for this portable install is in the default location (on the same disk as the docs and Calibre portable).

I also see that 0.8.25 just came out, so I upgraded everything and the behavior didn't change.

Lastly, as a test, I saved two pages of this forum using the same "File | Save page as | complete" function in Firefox, and created a corresponding TOC file.

That TOC file looks like this:
<html>
<body>
<h3>Table of Contents</h3>
<p style="text-indent:0pt">
<a href="./MobiThread1Complete.html">Part One</a><br/>
<a href="./MObiThread2Complete.html">Part Two</a><br/>
</p>
</body>
</html>

I dragged and dropped this MobiThread TOC file into Calibre and that DID create a ZIP file.

So, I'm wondering what the delta's are, since the two TOC files are virtually identical in format.

Thanks for the help.

theducks · 11-06-2011, 11:25 AM

@Webdad

by any chance are you NOT running on a case insensitive OS (Windows)?
Those 2 example HTML have really mixed up case file names.
Case sensitive OS need the file names to match EXACTLY, includes the extension.

webdad · 11-06-2011, 09:16 PM

Yeah, that is an artifact of a quick and dirty test. The OS is Windows 7, but I've gone back and checked the case anyway. all match - as strange as they are.

The failing file is at least consistent.

kovidgoyal · 11-06-2011, 09:19 PM

Run calibre in debug mode (right click the preferences button) and you will get more info about what is going wrong.

webdad · 11-13-2011, 12:03 PM

Thanks for all the assistance and comments.

I ran in debug mode and found that the parser was throwing an error while processing header information. The files have a large embedded CSS along with lots of other code that isn't needed for this conversion (approximately 1200 lines of code/css).

So, I found a nice basic data parsing tool and created a simple script to extract out just the text of the page.

Everything looks good now.

Thanks again

11-05-2011, 05:07 PM	#1
webdad Junior Member Posts: 4 Karma: 10 Join Date: Nov 2011 Device: Kindle 2nd Gen	HTML file doesn't import to ZIP I'm trying to convert a collection of downloaded HTML pages to an ebook. The pages are downloaded to a directory with corresponding subdirectories for the "complete" portion of each page. page80.html page81.html page80_files <--- Subdirectory page81_files <--- Subdirectory I have the following "TOC" HTML file in the same main directory: <html> <body> <h1>Table of Contents</h1> <p style="text-indent:0pt"> <a href="./page80.html">80</a><br/> <a href="./page81.html">81</a><br/> </p> </body> </html> When I drag the TOC file into Calibre, it comes in as just an HTML file. No corresponding ZIP file containing the TOC file, the sub pages, and the sub page contents is created. When I open the TOC file in the Calibre viewer and examine the href links, they resolve to a temp directory in C:\users\...appdata ... yada yada yada\page80.html, which isn't where the files are truly located. So, for some reason, Calibre isn't getting the member pages nor the contents of the subdirectories into the zip files. Would an error in the underlying page files' HTML format cause this or any ideas how to find out what the issue is? I've looked through msgs/thread here but haven't seen this type of issue. Thanks.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Where is the .exe file in the zip Calibre2opds file?	chilady1	Related Tools	4	09-17-2011 11:56 AM
HTML to ZIP, breadth first	gus.is.here	Conversion	4	09-14-2011 10:18 AM
Convert HTML to MOBI (HTML recognized as ZIP file)	pdubois	Conversion	1	01-25-2011 12:55 PM
html file read as zip	Newmarket2	Calibre	12	01-05-2011 03:17 PM
Need help with Caliber html to zip?	Csilla	Calibre	6	11-13-2010 05:41 PM

11-06-2011, 10:34 AM	#3
webdad Junior Member Posts: 4 Karma: 10 Join Date: Nov 2011 Device: Kindle 2nd Gen	The plug-in is green. I toggled it off to gray / back on to green and deleted my original doc. When I added it back in, same result. Just an HTML file. This is Calibre portable running from a disk where the docs are stored (which isn't my C drive). The Calibre libray for this portable install is in the default location (on the same disk as the docs and Calibre portable). I also see that 0.8.25 just came out, so I upgraded everything and the behavior didn't change. Lastly, as a test, I saved two pages of this forum using the same "File \| Save page as \| complete" function in Firefox, and created a corresponding TOC file. That TOC file looks like this: <html> <body> <h3>Table of Contents</h3> <p style="text-indent:0pt"> <a href="./MobiThread1Complete.html">Part One</a><br/> <a href="./MObiThread2Complete.html">Part Two</a><br/> </p> </body> </html> I dragged and dropped this MobiThread TOC file into Calibre and that DID create a ZIP file. So, I'm wondering what the delta's are, since the two TOC files are virtually identical in format. Thanks for the help.

11-06-2011, 11:25 AM	#4
theducks Well trained by Cats Posts: 29,782 Karma: 54830978 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	@Webdad by any chance are you NOT running on a case insensitive OS (Windows)? Those 2 example HTML have really mixed up case file names. Case sensitive OS need the file names to match EXACTLY, includes the extension.

11-06-2011, 09:16 PM	#5
webdad Junior Member Posts: 4 Karma: 10 Join Date: Nov 2011 Device: Kindle 2nd Gen	Yeah, that is an artifact of a quick and dirty test. The OS is Windows 7, but I've gone back and checked the case anyway. all match - as strange as they are. The failing file is at least consistent.

11-06-2011, 09:19 PM	#6
kovidgoyal creator of calibre Posts: 43,844 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Run calibre in debug mode (right click the preferences button) and you will get more info about what is going wrong.

11-13-2011, 12:03 PM	#7
webdad Junior Member Posts: 4 Karma: 10 Join Date: Nov 2011 Device: Kindle 2nd Gen	Thanks for all the assistance and comments. I ran in debug mode and found that the parser was throwing an error while processing header information. The files have a large embedded CSS along with lots of other code that isn't needed for this conversion (approximately 1200 lines of code/css). So, I found a nice basic data parsing tool and created a simple script to extract out just the text of the page. Everything looks good now. Thanks again

Advert

Advert