![]() |
#1 |
before sleep, read or TV?
![]() Posts: 106
Karma: 10
Join Date: Apr 2008
Location: Australia
Device: Kobo Libra 2
|
![]()
Hello all, this is something I've been looking around for since i got my sony reader, but noone ha been able to do it.
Last night, i remembered some projects where people were able to put wikipedia on theys ipods, so i started from there. For those of you that didnt knew, wikipedia has a dowloadable version, on all it's languages and this can be in XML or HTML format (i think there are some more). The thing is, you can convert HTML files using calibre, from html to ePub. Why ePub? Because the conversion is done directly and the outputted file is compressed, saving some more space on the device. Now, here's the bad part. Only in spanish (my language) the file is 1.3 GB compressed. The result is a little more than 1,300,000 files in different folders and also including the user comments or discuddion for each article, for a nice total of 19 GB of data. Lets say i remove the user comments/discussions, maybe i'll get the file to a mere 2 GB uncompressed. After running the file through calibre and compressing/converting them to ePub, lets say it stays on 1.3 GB again (this are just guesses i'm trying). (remember, this is for the spanish wikipedia). How on earth am i going to put 1.3GB of data on my reader? I can get a cheap 2 or 4 GB sandisk memory stick duo (or SD card) and store the files there, the thing is that it will be more than 600,000 files to store and that the reader will have to read an manage. Is the reader even able to handle such an ammount of files??? will the baterry drain just from trying to list the files on the book collection? The point of leaving each article as a separate ePub file, is that the reader manages the collection easily by letther and all, so it wont really be necessary to have a "search" function, the reader would arrange them by title making it kinda easy to search. That is the reason why you couldnt also put a lot of file into one singe ePub file to decrease the file count on the memory stick/SD. -Does anybody have an idea on how this could be handled? -any idea on how to rapidly convert all the files to ePub woithout having to import them to calibre? (kind of like a batch process). -does anybody know if there is a lighter wikipedia downloadable version? -i know there is another downloadable wikipedia version for Tomeraider, but that isnt compatible with sony reader, maybe we can convert it. Ill keep searching and posting my findings, but it would be great to see if somebody has ny feedback to give. thank you all. |
![]() |
![]() |
![]() |
#2 |
before sleep, read or TV?
![]() Posts: 106
Karma: 10
Join Date: Apr 2008
Location: Australia
Device: Kobo Libra 2
|
Ok, i downloaded a smaller version of the articles, but it was just ONE big XML file... nothing to do there.
i'm still extracting the big 19GB file for all the articles, after that i have to remove the "user comments" and other extra stuff that we dont need and i'll start converting the html files. i still have to see if i will use html2epub directly (have to learn to do that from command prompt) or calibre's GUI. If anybody know another way to convert a bunch of html files to another format easily, please let me know. |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
reader
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 6,977
Karma: 5183568
Join Date: Mar 2006
Location: Mississippi, USA
Device: Kindle 3, Kobo Glo HD
|
Is a multi-GB wikipedia ePub practical? Unfortunately, I think the answer is no.
At one level everything is ok, the ePub is internally a directory tree of files and for wikipedia each article would be a file. The ePub's metafiles (.opf and .ncx) would be bigger than usual, but perhaps you could design a reasonable TOC. Adobe Digital Editions (ADE) only processes each "file" (article) when it needs to display it on the screen - so that is ok (although other ePub readers might work differently). The basic problem is with the container, i.e. the ZIP file that is the .epub. This has not been designed, so far as I can tell, to be an efficient replacement for a full filesystem. One approach that I can see an ePub reader taking is to uncompress the entire document to a temporary directory tree when the document is opened and then work from that. This is impractical for a large ePub. So, assume ADE isn't doing this but it doing on-the-fly decompression. What happens when you follow a link to another article? ADE has to find the relevant file in the huge ZIP file without walking the entire file. I don't think this can be done efficiently enough. I am hand waving a bit in the above, because I don't know all the details of how ZIP and ADE work. However, I am certain that any ebook reader (computer code) designed for books in the MB range will fail on books in the GB range, particularly when running on resource limited hardware. |
![]() |
![]() |
![]() |
#4 |
before sleep, read or TV?
![]() Posts: 106
Karma: 10
Join Date: Apr 2008
Location: Australia
Device: Kobo Libra 2
|
yeah, i thought about the GB subject you explain. But the idea is to have each article as a separate file, and not everything into one big 19GB file. Each article is about 29 KB (without images).
Saddly, the link should be discarded as there is no way to link one ebook to another (that i know of). The problem wouldnt be the size of the file, but rather the ammount of them. Could a sony reader handle 600.000 files? i'll give it a try and tell everybody. Maybe at the end we can find a Light"wikipedia version, probably the DVD version could be used, but of course wont contain ALL of the articles. |
![]() |
![]() |
![]() |
#5 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 45,185
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
zip files have indices, so file access time should be a constant independent of the number of files. That said, the calibre ebook viewer for instance unzips the epub (it's more convenient that way). I dont know what Adobe DE does
|
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Zealot
![]() ![]() Posts: 134
Karma: 146
Join Date: Apr 2008
Device: Onyx Boox Poke 2
|
My guess is it extracts files on an as-needed basis, noting how quickly my 505 can open large epub books. If it was extracting the whole zip first it'd take a fair bit longer.
|
![]() |
![]() |
![]() |
#7 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 316
Karma: 1021312
Join Date: Jun 2009
Device: Sony PRS-T1
|
Did anybody try to convert those mobipocket Wikipedia files to epub:
http://pinguinburg.de/wpmp |
![]() |
![]() |
![]() |
#8 | |
FT Parent PT Reader
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 322
Karma: 187838
Join Date: Mar 2009
Location: South Alabama
Device: Shocked by how much I've read on an iPod Touch received as a gift!
|
Quote:
Download the Internet |
|
![]() |
![]() |
![]() |
#9 |
calibre2opds guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 533
Karma: 8792
Join Date: Aug 2005
Location: Metz, France
Device: iPhone, iPad, PRS-650
|
![]() |
![]() |
![]() |
![]() |
#10 |
Junior Member
![]() Posts: 4
Karma: 10
Join Date: Dec 2007
Device: Palm Treo 680
|
I, too, would love to have Wikipedia on my 505.
|
![]() |
![]() |
![]() |
#11 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
It would be saner to try doing this via PDF. (But of course pointless, unless the formatting could be gotten tolerably good.)
Why? I already have a 10,000+ page PDF version of Summa Theologica that is perfectly responsive. Maybe I'll try to create a PDF with chapters duplicated therein to get a 100,000+ page PDF, and report on whether it crashed my PRS-505. - Ahi Last edited by ahi; 08-07-2009 at 03:45 PM. |
![]() |
![]() |
![]() |
#12 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
The PDF is willing, but the LaTeX is weak... or so I am finding.
I'm finding myself unable to generate PDF documents larger than about 60,000 pages... albeit it's not really a page limit, but an overall amount of text limit that is being hit. I'll post more on this as I learn more from my attempts. - Ahi |
![]() |
![]() |
![]() |
#13 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 316
Karma: 1021312
Join Date: Jun 2009
Device: Sony PRS-T1
|
My hero! I hope you can make it!
![]() |
![]() |
![]() |
![]() |
#14 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
I have just successfully loaded a PDF with the following attributes:
- 53,432 pages - 133.2 MB - at least 15,000 items in external table of contents (2540 first level items, and each of those has 4 - 15+ subitems) - all pages basically full of text (content is the complete Summa Theologica repeated 5 times over) When loading it from my old-ish SD card, I've observed the following: - There is a one-time 7-8 minute loading time. - Page turning is slower than usual... about 3 seconds, but regardless whether one is going to the next page, or 30,000 pages further in the text. When loading it from main memory: - There is a one-time 4 minute loading time. - Page turning seems consistently about 3 seconds still. --------------- So then, it seems like for performance reasons (specific to the Sony PRS-505 at least) putting the full wikipedia into a PDF is out of the question. However, since my aforementioned 10,000 page PDF has less than 1 minute loading time and pretty snappy page-turning, a reasonably large (in absolute, not relative terms) subsection of wikipedia articles could still be rolled into a PDF eBook. Not good, if you really wanted to use Wikipedia as an encyclopedia (probably a questionable endeavour with any device that does not have text entry), but might be fine if you instead want to simply have thousands of interesting wikipedia articles available to read on your device. Something like 2008/9 Wikipedia Selection for Schools might be a good starting point for conversion. Though I'd personally want more articles and fewer pictures... anybody knows of other good wikipedia article vetting websites/agencies? - Ahi Last edited by ahi; 08-09-2009 at 10:12 AM. |
![]() |
![]() |
![]() |
#15 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Warning: Don't ever try to actually use an *external* table of contents that has over a hundred thousand entries. My PRS-505 has been "formatting" said table of contents for over an hour, and is responding to neither soft power-offs, nor to reset presses in the back.
It's working though... not frozen or anything... so will get done eventually. Update: it's fine now. But yeah... don't check the table of contents if it containts over a hundred-thousand entries. - Ahi Last edited by ahi; 08-09-2009 at 12:50 PM. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Leo's Wikipedia Reader for Hanlin | leobueno | HanLin eBook | 28 | 02-08-2010 12:33 PM |
Wikipedia for the Reader | kichigai | Sony Reader | 5 | 08-15-2009 09:20 AM |
Wikipedia on Sony Reader through Tomeraider? | noxxle | Sony Reader | 0 | 05-16-2008 11:29 PM |
wikipedia for sony reader? | rsdavis9 | BBeB/LRF Books | 2 | 11-26-2007 03:39 PM |
Wikipedia CD on the Reader | hn_88 | Sony Reader | 18 | 02-13-2007 09:11 AM |