02-23-2011, 07:32 PM | #1 |
Member
Posts: 23
Karma: 200001
Join Date: Feb 2011
Device: BPDN, Kobo wifi
|
Wikipedia?
I've been attempting to import the 1900+ Wikipedia articles from the kiwix.org "kiwix-0.5.iso" CD as an .epub but am encountering a few bugs along the way.
Kiwix is a project intended to create an offline reader for Wikipedia. It currently uses .ZIM, a non-standard compressed file format which can be read with the open-source libzim, according to online documentation. The "kiwix-0.5.iso" archive, however, is an old version which contains an /html/... directory tree with plain, uncompressed web pages for this small selection of Wikipedia texts. (Update: There is a utility "zimdump" provided as part of the source code package for Zimlib, available from openzim.org; this has been used successfully to convert Kiwix .ZIM archives into CD or DVD-sized piles of individual *.html and *.png/*.jpg files. Once you have zimlib installed from source, go to zimlib/src/tools and type 'make' to build the optional command-line utilities which you will need. At that point, 'zimdump -D destination_directory -f first_article_name input_filename.ZIM' should dump everything back to the original format, articles in destination_directory/A/* and images in destination_directory/I/*. The articles will need to be renamed to add the '.html' suffix, to replace any blank spaces in the name with _ underscores and to fix any URL-encoded accented/Unicode characters before importing this mess into Sigil.) I'd tried renaming the files (which have names like /html/art/a/w/9.html) to something meaningful and then importing them into a Sigil *.epub document. I have noticed one bug in Sigil; if there are two or more images which have the same base filename but different path, the auto-rename which Sigil attempts to use to resolve this conflict tends to be sporadic at best. This leaves many missing images in the resulting *.epub files. Subsequent attempts based on the Kiwix-style *.ZIM archives appear to be more successful as these abandon the oddball three-level, one-character base name file structure of the old 0.5 version of this collection; this was used for the schools encyclopedia described in a subsequent post to this thread. I also find that there seems to be a practical limit (likely no more than 500 typical encyclopaedia articles) for what can be contained in a single *.epub file without creating problems. The table of contents generation in Sigil is also problematic, insofar as it insists on taking every HTML heading (h1, h2, h3, h4, h5) from within the individual articles and creating a multi-megabyte table of contents which is unusable to the reader due to its sheer size. I've split this project into four separate *.epub files (like the alphabetical volumes of a printed encyclopaedia) and removed all but the first-level article names from the table of contents and the result is almost usable. Almost. The handling of large tables (such as the main "Version 0.5" content overview page which appears as the first chapter of these generated *.epubs) appears to be breaking badly on Kobo wi-fi. Open the encyclopaedia to the first chapter and, instead of using the menus to skip directly to another chapter using the table of contents, just try paging through Chapter 1 (the huge table listing what's in this selection). At some point (usually on the first page turn) the Kobo will decide that it's taking too long to make sense of such a huge, unwieldly HTML table and reboot itself. This would appear to be a firmware bug, as the text is entirely readable on PC-based tools such as the document viewer in Calibre. Is there any fix for this issue? Last edited by carlb; 03-04-2011 at 12:44 PM. |
02-23-2011, 08:12 PM | #2 |
Wears funny hat (cloth)
Posts: 28
Karma: 26
Join Date: Dec 2010
Location: Limbo
Device: Kobo WiFi, Kobo Touch
|
Kobo + WikiReader
Good luck on this definitely a worthwhile project. I'm trying it out but unfortunately can't help fix bugs.
As I read on my Kobo I have a $100 Openmoko WikiReader, a little touchscreen monochrome LCD gadget (http://thewikireader.com/) which has most Wikipedia articles (no images, no tables or lists), Wiktionary, Wikitravel, Wikiquotes, and ~33,000 Project Gutenberg books (not very usable but a neat try). |
Advert | |
|
02-24-2011, 12:05 AM | #3 |
Guru
Posts: 815
Karma: 1029784
Join Date: May 2008
Location: Nebraska, USA
Device: PEZ, Color Libre, 2@Sony T1, Onyx i62HD
|
I purchased the WikiReader, one for myself, and one for my granddaughter's school work. It's great for a quick check of information while I'm in front of the TV (which is where I keep it)
Anyway the site has the entire wikipedia information where you can download it. Lots of open source files and "how to" get other wiki's. If you can figure it out and translate it, it's there, free for download. https://github.com/wikireader/wikireader/wiki The main site is http://thewikireader.com/ http://dev.thewikireader.com/language-packs/ this is another site for the developments. Unfortunately, all the developement is beyond my programming skills which stopped at Basic (the old DOS days....) Good luck. AJ |
03-04-2011, 12:07 PM | #4 |
Member
Posts: 23
Karma: 200001
Join Date: Feb 2011
Device: BPDN, Kobo wifi
|
Your door-to-door encyclopaedia salesperson strikes again...
I've posted another set of encyclopaedia .epub's, this one with 5500 Wikipedia articles which had been distributed on CD as an encyclopaedia for schools in October 2008.
As most images (in whatever size they appeared on the original Wikipedia pages) are retained in this collection, the size of each individual volume appears to be about fifty megabytes... effectively requiring a CD's worth of space to store the full fifteen-volume set (186 MB of text, 430 MB of images). This makes even this severely-abridged set too large to upload here. As such, I'm back to peddling encyclopaedias door-to-door and am in this fine neighbourhood today: http://epub.wikipedia.cx I post the first volume as a sample, absolutely free of charge, at the end of this message as a token of thanks for having heard what I have to offer you today (and, perhaps, because it is the only volume which fit in under this site's 20MB *.epub file size limit). While this set is not itself a project of the Wikimedia Foundation, it does use content generated by various individual Wikipedia contributors and is licensed under the GNU Free Documentation License for free use. I would urge you to go to http://epub.wikipedia.cx today and acquire this fine set of encyclopaedia volumes... that way you may sleep tonight with the security of knowing that your goldfish will not flunk out of their school for a lack of brain food and that their future will be secure. Certainly this would be a bargain at twice the price... but there's more. This collection will use about two thirds of the memory in a stock Kobo reader, but perhaps I could interest you in this fine microSD card upgrade which would provide your home with enough shelving space to store the information equivalent of three DVD's for about $C30, using components available at any local computer store. *Some assembly required. Last edited by carlb; 03-05-2011 at 12:59 AM. |
03-04-2011, 09:15 PM | #5 |
GuteBook/Mobi2IMP Creator
Posts: 2,958
Karma: 2530691
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
|
Very nice effort! Well worth the download time...
I did something similar with the 2006 version of the SOS Children's Wikipedia CD and posted my experiences and challenges in the thread entitled Creating HUGE ebooks from the 2006 Wikipedia CD Selection. I too had to locate my ebooks off-site ( see here ), due to the max. upload limit here. I could contain everything within 1 ebook primarily since I reduced pictures to a max. image size (150x150) and max. color depth of 16 colors (4 bit). Without doing this, I too would have had to split the ebook to fit internal memory, but then I would lose some of the links pointing to the split .epub volumes. Those pictures were too small, but allowed the final ebook size to be reduced! Also, reducing the color to 16 created some "banding" effects, but again drastically reduced the final ebook size. All in all, an acceptable compromise. Try reducing your images to 4 bit .png (optimzed) i.e. 16 colours (or .jpg/.gif if smaller), then recreate the ebook to see the file size savings... Did you try to create a version without ANY images (just rename your images folder temporarily to another name) just to see the min. ebook filesize? Keep up the good work; I just love these gargantuan ebooks. Last edited by nrapallo; 03-04-2011 at 09:19 PM. |
Advert | |
|
03-05-2011, 03:32 AM | #6 |
Wears funny hat (cloth)
Posts: 28
Karma: 26
Join Date: Dec 2010
Location: Limbo
Device: Kobo WiFi, Kobo Touch
|
Tnx to Carlb for pointing to the shorter Wikipedia at the .cx address. (CX is Christmas Island, an Australian territory not that far from Jakarta. That's exotic enough but then I went up to the homepage, which seems to be the Portuguese Wikipedia site.) With some effort I loaded the 15 files (total is 600MB) onto the 2MB SD card, using Calibre. Kobo accepted them, after some hiccuping.
Wikipedia on the Kobo lacks search, of course, and getting to an individual article takes loading one of the volumes, then clicking in the TOC to see if your topic is there, but it does work. The format is best at smallest type size, but all the illustrations in the files seem intact. I hope the next firmware will facilitate using such reference works on the Kobo, ideally with ability to have at least two books open at the same time. |
03-05-2011, 08:27 AM | #7 | ||
GuteBook/Mobi2IMP Creator
Posts: 2,958
Karma: 2530691
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
|
Quote:
A few things I had to change (since I'm using a Western-world WinXP computer):
P.S. it appears 16 color (4bit) .png images don't display properly on older versions of ADE (like the PC previewer I use), so I also prepared a 256 color .png version of "Encyclopedia A.epub" and called it "Encyclopedia A_fixed256.epub". This one added about 3MB and is 26.4 MB (27,690,600 bytes). Quote:
Last edited by nrapallo; 03-05-2011 at 09:54 AM. Reason: typo |
||
03-07-2011, 04:05 PM | #8 | |
GuteBook/Mobi2IMP Creator
Posts: 2,958
Karma: 2530691
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
|
Quote:
Only one for now... Last edited by nrapallo; 10-23-2014 at 12:06 AM. |
|
03-09-2011, 12:12 PM | #9 | |
Member
Posts: 23
Karma: 200001
Join Date: Feb 2011
Device: BPDN, Kobo wifi
|
Quote:
For such a large quantity of images, it may be easier to use http://www.imagemagick.org/script/convert.php as it can be run from a batch file to -resize or change colour -depth of large numbers of images at once. Wikipedia's server-side MediaWiki software invokes ImageMagick's free convert utility as one commonly-used means to generate thumbnails from uploaded photos for use on content pages; it won't convert .SVG to other formats but will do just about anything else. |
|
03-09-2011, 12:49 PM | #10 | |
GuteBook/Mobi2IMP Creator
Posts: 2,958
Karma: 2530691
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
|
http://www.irfanview.com/
Quote:
By the way, do you use Linux or did you compile the zimdump utility for Windows computers. I wouldn't mind having an executable copy of zimdump... Last edited by nrapallo; 03-09-2011 at 12:52 PM. |
|
11-18-2011, 01:45 PM | #11 |
Member
Posts: 22
Karma: 10
Join Date: Oct 2011
Location: RI, USA
Device: Aluratek Libre, Velocity Cruz T301, EZReader, Iview 435TPC, Wikireader
|
It works on the Aluratek Libre Pro!
The Wikipedia downloads that carlb posted work in the Aluratek Libre Pro! This is excellent! Great work, carlb!
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
wikipedia | bo-kai | Sony Reader | 4 | 09-23-2010 06:32 PM |
Wikipedia with 2.12? | Gogolo | iRex | 20 | 04-29-2008 07:17 PM |
iLiad Wikipedia | smoogle | iRex Developer's Corner | 8 | 03-28-2008 10:59 AM |
Reference Wikipedia: SOS Children 2006 Wikipedia CD | hn_88 | BBeB/LRF Books | 0 | 01-29-2008 12:23 PM |
iLiad I want wikipedia... | narve | iRex Developer's Corner | 15 | 08-16-2007 07:38 AM |