Thread: Wikipedia?
View Single Post
Old 02-23-2011, 08:32 PM   #1
carlb
Member
carlb ought to be getting tired of karma fortunes by now.carlb ought to be getting tired of karma fortunes by now.carlb ought to be getting tired of karma fortunes by now.carlb ought to be getting tired of karma fortunes by now.carlb ought to be getting tired of karma fortunes by now.carlb ought to be getting tired of karma fortunes by now.carlb ought to be getting tired of karma fortunes by now.carlb ought to be getting tired of karma fortunes by now.carlb ought to be getting tired of karma fortunes by now.carlb ought to be getting tired of karma fortunes by now.carlb ought to be getting tired of karma fortunes by now.
 
Posts: 23
Karma: 200001
Join Date: Feb 2011
Device: BPDN, Kobo wifi
Post Wikipedia?

I've been attempting to import the 1900+ Wikipedia articles from the kiwix.org "kiwix-0.5.iso" CD as an .epub but am encountering a few bugs along the way.

Kiwix is a project intended to create an offline reader for Wikipedia. It currently uses .ZIM, a non-standard compressed file format which can be read with the open-source libzim, according to online documentation. The "kiwix-0.5.iso" archive, however, is an old version which contains an /html/... directory tree with plain, uncompressed web pages for this small selection of Wikipedia texts.

(Update: There is a utility "zimdump" provided as part of the source code package for Zimlib, available from openzim.org; this has been used successfully to convert Kiwix .ZIM archives into CD or DVD-sized piles of individual *.html and *.png/*.jpg files. Once you have zimlib installed from source, go to zimlib/src/tools and type 'make' to build the optional command-line utilities which you will need. At that point, 'zimdump -D destination_directory -f first_article_name input_filename.ZIM' should dump everything back to the original format, articles in destination_directory/A/* and images in destination_directory/I/*. The articles will need to be renamed to add the '.html' suffix, to replace any blank spaces in the name with _ underscores and to fix any URL-encoded accented/Unicode characters before importing this mess into Sigil.)

I'd tried renaming the files (which have names like /html/art/a/w/9.html) to something meaningful and then importing them into a Sigil *.epub document. I have noticed one bug in Sigil; if there are two or more images which have the same base filename but different path, the auto-rename which Sigil attempts to use to resolve this conflict tends to be sporadic at best. This leaves many missing images in the resulting *.epub files. Subsequent attempts based on the Kiwix-style *.ZIM archives appear to be more successful as these abandon the oddball three-level, one-character base name file structure of the old 0.5 version of this collection; this was used for the schools encyclopedia described in a subsequent post to this thread.

I also find that there seems to be a practical limit (likely no more than 500 typical encyclopaedia articles) for what can be contained in a single *.epub file without creating problems. The table of contents generation in Sigil is also problematic, insofar as it insists on taking every HTML heading (h1, h2, h3, h4, h5) from within the individual articles and creating a multi-megabyte table of contents which is unusable to the reader due to its sheer size.

I've split this project into four separate *.epub files (like the alphabetical volumes of a printed encyclopaedia) and removed all but the first-level article names from the table of contents and the result is almost usable. Almost.

The handling of large tables (such as the main "Version 0.5" content overview page which appears as the first chapter of these generated *.epubs) appears to be breaking badly on Kobo wi-fi. Open the encyclopaedia to the first chapter and, instead of using the menus to skip directly to another chapter using the table of contents, just try paging through Chapter 1 (the huge table listing what's in this selection). At some point (usually on the first page turn) the Kobo will decide that it's taking too long to make sense of such a huge, unwieldly HTML table and reboot itself.

This would appear to be a firmware bug, as the text is entirely readable on PC-based tools such as the document viewer in Calibre.

Is there any fix for this issue?
Attached Files
File Type: epub Encyclopedia A-D.epub (14.35 MB, 235 views)
File Type: epub Encyclopedia E-K.epub (10.31 MB, 169 views)
File Type: epub Encyclopedia L-Q.epub (9.52 MB, 167 views)
File Type: epub Encyclopedia R-Z.epub (9.82 MB, 170 views)

Last edited by carlb; 03-04-2011 at 01:44 PM.
carlb is offline   Reply With Quote