Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Readers > Sony Reader > Sony Reader Dev Corner

Notices

Reply
 
Thread Tools Search this Thread
Old 10-25-2008, 10:49 AM   #1
gonzule
before sleep, read or TV?
gonzule began at the beginning.
 
Posts: 103
Karma: 10
Join Date: Apr 2008
Location: Australia
Device: Kobo Libra 2
Post Wikipedia on Sony Reader

Hello all, this is something I've been looking around for since i got my sony reader, but noone ha been able to do it.

Last night, i remembered some projects where people were able to put wikipedia on theys ipods, so i started from there. For those of you that didnt knew, wikipedia has a dowloadable version, on all it's languages and this can be in XML or HTML format (i think there are some more).

The thing is, you can convert HTML files using calibre, from html to ePub. Why ePub? Because the conversion is done directly and the outputted file is compressed, saving some more space on the device.

Now, here's the bad part. Only in spanish (my language) the file is 1.3 GB compressed. The result is a little more than 1,300,000 files in different folders and also including the user comments or discuddion for each article, for a nice total of 19 GB of data.

Lets say i remove the user comments/discussions, maybe i'll get the file to a mere 2 GB uncompressed. After running the file through calibre and compressing/converting them to ePub, lets say it stays on 1.3 GB again (this are just guesses i'm trying). (remember, this is for the spanish wikipedia).

How on earth am i going to put 1.3GB of data on my reader? I can get a cheap 2 or 4 GB sandisk memory stick duo (or SD card) and store the files there, the thing is that it will be more than 600,000 files to store and that the reader will have to read an manage. Is the reader even able to handle such an ammount of files??? will the baterry drain just from trying to list the files on the book collection?

The point of leaving each article as a separate ePub file, is that the reader manages the collection easily by letther and all, so it wont really be necessary to have a "search" function, the reader would arrange them by title making it kinda easy to search.

That is the reason why you couldnt also put a lot of file into one singe ePub file to decrease the file count on the memory stick/SD.

-Does anybody have an idea on how this could be handled?
-any idea on how to rapidly convert all the files to ePub woithout having to import them to calibre? (kind of like a batch process).
-does anybody know if there is a lighter wikipedia downloadable version?
-i know there is another downloadable wikipedia version for Tomeraider, but that isnt compatible with sony reader, maybe we can convert it.

Ill keep searching and posting my findings, but it would be great to see if somebody has ny feedback to give.

thank you all.
gonzule is offline   Reply With Quote
Old 10-25-2008, 02:45 PM   #2
gonzule
before sleep, read or TV?
gonzule began at the beginning.
 
Posts: 103
Karma: 10
Join Date: Apr 2008
Location: Australia
Device: Kobo Libra 2
Ok, i downloaded a smaller version of the articles, but it was just ONE big XML file... nothing to do there.

i'm still extracting the big 19GB file for all the articles, after that i have to remove the "user comments" and other extra stuff that we dont need and i'll start converting the html files.

i still have to see if i will use html2epub directly (have to learn to do that from command prompt) or calibre's GUI.

If anybody know another way to convert a bunch of html files to another format easily, please let me know.
gonzule is offline   Reply With Quote
Advert
Old 10-25-2008, 04:20 PM   #3
wallcraft
reader
wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.wallcraft ought to be getting tired of karma fortunes by now.
 
wallcraft's Avatar
 
Posts: 6,975
Karma: 5183568
Join Date: Mar 2006
Location: Mississippi, USA
Device: Kindle 3, Kobo Glo HD
Is a multi-GB wikipedia ePub practical? Unfortunately, I think the answer is no.

At one level everything is ok, the ePub is internally a directory tree of files and for wikipedia each article would be a file. The ePub's metafiles (.opf and .ncx) would be bigger than usual, but perhaps you could design a reasonable TOC. Adobe Digital Editions (ADE) only processes each "file" (article) when it needs to display it on the screen - so that is ok (although other ePub readers might work differently).

The basic problem is with the container, i.e. the ZIP file that is the .epub. This has not been designed, so far as I can tell, to be an efficient replacement for a full filesystem. One approach that I can see an ePub reader taking is to uncompress the entire document to a temporary directory tree when the document is opened and then work from that. This is impractical for a large ePub. So, assume ADE isn't doing this but it doing on-the-fly decompression. What happens when you follow a link to another article? ADE has to find the relevant file in the huge ZIP file without walking the entire file. I don't think this can be done efficiently enough.

I am hand waving a bit in the above, because I don't know all the details of how ZIP and ADE work. However, I am certain that any ebook reader (computer code) designed for books in the MB range will fail on books in the GB range, particularly when running on resource limited hardware.
wallcraft is offline   Reply With Quote
Old 10-25-2008, 06:35 PM   #4
gonzule
before sleep, read or TV?
gonzule began at the beginning.
 
Posts: 103
Karma: 10
Join Date: Apr 2008
Location: Australia
Device: Kobo Libra 2
yeah, i thought about the GB subject you explain. But the idea is to have each article as a separate file, and not everything into one big 19GB file. Each article is about 29 KB (without images).

Saddly, the link should be discarded as there is no way to link one ebook to another (that i know of). The problem wouldnt be the size of the file, but rather the ammount of them. Could a sony reader handle 600.000 files? i'll give it a try and tell everybody. Maybe at the end we can find a Light"wikipedia version, probably the DVD version could be used, but of course wont contain ALL of the articles.
gonzule is offline   Reply With Quote
Old 10-25-2008, 07:47 PM   #5
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,850
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
zip files have indices, so file access time should be a constant independent of the number of files. That said, the calibre ebook viewer for instance unzips the epub (it's more convenient that way). I dont know what Adobe DE does
kovidgoyal is offline   Reply With Quote
Advert
Old 10-29-2008, 08:30 AM   #6
erayd
Zealot
erayd doesn't littererayd doesn't litter
 
Posts: 134
Karma: 146
Join Date: Apr 2008
Device: Onyx Boox Poke 2
My guess is it extracts files on an as-needed basis, noting how quickly my 505 can open large epub books. If it was extracting the whole zip first it'd take a fair bit longer.
erayd is offline   Reply With Quote
Old 07-13-2009, 08:49 AM   #7
Lbooker
Addict
Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.
 
Posts: 316
Karma: 1021312
Join Date: Jun 2009
Device: Sony PRS-T1
Did anybody try to convert those mobipocket Wikipedia files to epub:
http://pinguinburg.de/wpmp
Lbooker is offline   Reply With Quote
Old 07-13-2009, 09:18 AM   #8
pking36330
FT Parent PT Reader
pking36330 can program the VCR without an owner's manual.pking36330 can program the VCR without an owner's manual.pking36330 can program the VCR without an owner's manual.pking36330 can program the VCR without an owner's manual.pking36330 can program the VCR without an owner's manual.pking36330 can program the VCR without an owner's manual.pking36330 can program the VCR without an owner's manual.pking36330 can program the VCR without an owner's manual.pking36330 can program the VCR without an owner's manual.pking36330 can program the VCR without an owner's manual.pking36330 can program the VCR without an owner's manual.
 
pking36330's Avatar
 
Posts: 322
Karma: 187838
Join Date: Mar 2009
Location: South Alabama
Device: Shocked by how much I've read on an iPod Touch received as a gift!
Quote:
Originally Posted by wallcraft View Post
Is a multi-GB wikipedia ePub practical? Unfortunately, I think the answer is no.
Sure it is, W3Schools even wrote a utility to help:

Download the Internet
pking36330 is offline   Reply With Quote
Old 07-16-2009, 11:32 AM   #9
dpierron
calibre2opds guru
dpierron shines like a glazed doughnut.dpierron shines like a glazed doughnut.dpierron shines like a glazed doughnut.dpierron shines like a glazed doughnut.dpierron shines like a glazed doughnut.dpierron shines like a glazed doughnut.dpierron shines like a glazed doughnut.dpierron shines like a glazed doughnut.dpierron shines like a glazed doughnut.dpierron shines like a glazed doughnut.dpierron shines like a glazed doughnut.
 
dpierron's Avatar
 
Posts: 533
Karma: 8792
Join Date: Aug 2005
Location: Metz, France
Device: iPhone, iPad, PRS-650
dpierron is offline   Reply With Quote
Old 08-06-2009, 12:33 AM   #10
scythe000
Junior Member
scythe000 began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Dec 2007
Device: Palm Treo 680
I, too, would love to have Wikipedia on my 505.
scythe000 is offline   Reply With Quote
Old 08-07-2009, 03:38 PM   #11
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
It would be saner to try doing this via PDF. (But of course pointless, unless the formatting could be gotten tolerably good.)

Why? I already have a 10,000+ page PDF version of Summa Theologica that is perfectly responsive. Maybe I'll try to create a PDF with chapters duplicated therein to get a 100,000+ page PDF, and report on whether it crashed my PRS-505.

- Ahi

Last edited by ahi; 08-07-2009 at 03:45 PM.
ahi is offline   Reply With Quote
Old 08-08-2009, 11:17 PM   #12
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
The PDF is willing, but the LaTeX is weak... or so I am finding.

I'm finding myself unable to generate PDF documents larger than about 60,000 pages... albeit it's not really a page limit, but an overall amount of text limit that is being hit.

I'll post more on this as I learn more from my attempts.

- Ahi
ahi is offline   Reply With Quote
Old 08-09-2009, 05:18 AM   #13
Lbooker
Addict
Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.Lbooker ought to be getting tired of karma fortunes by now.
 
Posts: 316
Karma: 1021312
Join Date: Jun 2009
Device: Sony PRS-T1
My hero! I hope you can make it!
Lbooker is offline   Reply With Quote
Old 08-09-2009, 10:09 AM   #14
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
I have just successfully loaded a PDF with the following attributes:

- 53,432 pages
- 133.2 MB
- at least 15,000 items in external table of contents (2540 first level items, and each of those has 4 - 15+ subitems)
- all pages basically full of text (content is the complete Summa Theologica repeated 5 times over)

When loading it from my old-ish SD card, I've observed the following:

- There is a one-time 7-8 minute loading time.
- Page turning is slower than usual... about 3 seconds, but regardless whether one is going to the next page, or 30,000 pages further in the text.

When loading it from main memory:

- There is a one-time 4 minute loading time.
- Page turning seems consistently about 3 seconds still.

---------------

So then, it seems like for performance reasons (specific to the Sony PRS-505 at least) putting the full wikipedia into a PDF is out of the question.

However, since my aforementioned 10,000 page PDF has less than 1 minute loading time and pretty snappy page-turning, a reasonably large (in absolute, not relative terms) subsection of wikipedia articles could still be rolled into a PDF eBook.

Not good, if you really wanted to use Wikipedia as an encyclopedia (probably a questionable endeavour with any device that does not have text entry), but might be fine if you instead want to simply have thousands of interesting wikipedia articles available to read on your device.

Something like 2008/9 Wikipedia Selection for Schools might be a good starting point for conversion. Though I'd personally want more articles and fewer pictures... anybody knows of other good wikipedia article vetting websites/agencies?

- Ahi

Last edited by ahi; 08-09-2009 at 10:12 AM.
ahi is offline   Reply With Quote
Old 08-09-2009, 11:33 AM   #15
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Warning: Don't ever try to actually use an *external* table of contents that has over a hundred thousand entries. My PRS-505 has been "formatting" said table of contents for over an hour, and is responding to neither soft power-offs, nor to reset presses in the back.

It's working though... not frozen or anything... so will get done eventually.

Update: it's fine now. But yeah... don't check the table of contents if it containts over a hundred-thousand entries.

- Ahi

Last edited by ahi; 08-09-2009 at 12:50 PM.
ahi is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Leo's Wikipedia Reader for Hanlin leobueno HanLin eBook 28 02-08-2010 12:33 PM
Wikipedia for the Reader kichigai Sony Reader 5 08-15-2009 09:20 AM
Wikipedia on Sony Reader through Tomeraider? noxxle Sony Reader 0 05-16-2008 11:29 PM
wikipedia for sony reader? rsdavis9 BBeB/LRF Books 2 11-26-2007 03:39 PM
Wikipedia CD on the Reader hn_88 Sony Reader 18 02-13-2007 09:11 AM


All times are GMT -4. The time now is 03:20 AM.


MobileRead.com is a privately owned, operated and funded community.