Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Readers > Sony Reader

Notices

Reply
 
Thread Tools Search this Thread
Old 11-09-2009, 06:36 AM   #1
okalyddude
Enthusiast
okalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enough
 
Posts: 41
Karma: 602
Join Date: Oct 2009
Device: E600
Wikipedia offline and .tar format

Hey I know I've asked this elsewhere, but thought it deserved a topic:

Is anyone familiar with .tar format?

Does anyone think they know a possible way to convert the wikipedia collection .tar format into something usable on the prs 600? (or other devices)

You can download an offline version of wikipedia here: http://schools-wikipedia.org/

It's not full, but should be pretty adequate for the type of thing you'd want to do a quick check of on your reader. (or the always fun random article) It's 5 gigabytes, which is easily possible to do on the 600 / 700 (or any device that allows large expansion sticks)

-Can calibre do anything with the format?

-If we loaded it as a single file, is it even conceivable that it'd open? (the dictionary opens as one file, but is not nearly so large)

-If we loaded it as multiple files (lots of them) and put them under a single 'collection' is there any way to keep them off the regular books list? If not, we'd be forced to put all books into collections, and use that exclusively (as the regular 'books' list would be clogged)

-If you did this, and did a 'book search' for the article you wanted, could it search through such a large collection? Could you even MAKE such a large collection?

I know to do anything fancy (like supplement the dictionary lookup option, triple click search wikipedia offline or something) we'd need a firmware hack (crossing my fingers people keep working on this) but I was hoping to figure out a way to get something working under the current format restrictions... maybe just a pipe dream, but sure seems like it could be possible.
okalyddude is offline   Reply With Quote
Old 11-09-2009, 06:48 AM   #2
Nate the great
Sir Penguin of Edinburgh
Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.Nate the great ought to be getting tired of karma fortunes by now.
 
Nate the great's Avatar
 
Posts: 12,375
Karma: 23555235
Join Date: Apr 2007
Location: DC Metro area
Device: Shake a stick plus 1
Tar is just a compression format. Most any Linux computer will be able to extract the contents. I'm pretty sure that on a Windows machine WinRar can also decompress it.

I have a copy, and the content is just web pages.
Nate the great is offline   Reply With Quote
Advert
Old 11-09-2009, 06:54 AM   #3
Slite
Icanhasdonuts?
Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.
 
Slite's Avatar
 
Posts: 2,837
Karma: 532407
Join Date: Aug 2008
Location: Mölnbo, Sweden
Device: Kobo Aura 2nd edition, Kobo Clara HD
Quote:
Originally Posted by Nate the great View Post
Tar is just a compression format. Most any Linux computer will be able to extract the contents. I'm pretty sure that on a Windows machine WinRar can also decompress it.

I have a copy, and the content is just web pages.
Actually, tar in itself doesn't compress anything, it is just able to archive a bunch of files into a single file (or tarball) while preserving directory structure, permissions and stuff.

It was originaly intended for backup purposes and is short for Tape ARchiver (tar), it has become the defacto file archiving system for *nix systemes, and is usually used in conjunction with Gzip or Bzip2 to handle the compression of the tarball.
Slite is offline   Reply With Quote
Old 11-09-2009, 09:02 AM   #4
okalyddude
Enthusiast
okalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enough
 
Posts: 41
Karma: 602
Join Date: Oct 2009
Device: E600
So in conclusion, we have a zipped file containing thousands of small files.. html? My unzipping process canceled and I have to retry to uncompress, or maybe re-download, before I can take a look..

If it is a bunch of html, then I could do this? Does prs 600 support html, or would I have to convert the whole lot of them? And again, could I put them all under one "collection" and NOT have them show up in the 'all books'? Or even if they did, and I added them, do you think that I could easily search and open them?
okalyddude is offline   Reply With Quote
Old 11-09-2009, 09:07 AM   #5
Slite
Icanhasdonuts?
Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.Slite ought to be getting tired of karma fortunes by now.
 
Slite's Avatar
 
Posts: 2,837
Karma: 532407
Join Date: Aug 2008
Location: Mölnbo, Sweden
Device: Kobo Aura 2nd edition, Kobo Clara HD
Quote:
Originally Posted by okalyddude View Post
So in conclusion, we have a zipped file containing thousands of small files.. html? My unzipping process canceled and I have to retry to uncompress, or maybe re-download, before I can take a look..

If it is a bunch of html, then I could do this? Does prs 600 support html, or would I have to convert the whole lot of them? And again, could I put them all under one "collection" and NOT have them show up in the 'all books'? Or even if they did, and I added them, do you think that I could easily search and open them?
doesnt really matter if it is zipped or just "tarred", if you open the download in, for instance 7zip that can handle tar's without problems, you can see for your self what format they are in.

I don't think the 600 support HTML tho. But if it is HTML, you can convert it too something sensible most likely.
Slite is offline   Reply With Quote
Advert
Old 11-09-2009, 01:01 PM   #6
erio
Member
erio began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Nov 2009
Device: PRS-600
I don't have the PRS-600 yet, so i can't talk properly... And i don't even know how heavy is that tar file, so i'm just guessing now... We could convert all the files inside it to lrf or epub, and then use the search to find the word or name or whatever we're looking for? It woulbe really so useful to be able to acces the wikipedia no matter where or when...
erio is offline   Reply With Quote
Old 11-15-2009, 09:21 PM   #7
okalyddude
Enthusiast
okalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enough
 
Posts: 41
Karma: 602
Join Date: Oct 2009
Device: E600
Quote:
Originally Posted by Slite View Post
doesnt really matter if it is zipped or just "tarred", if you open the download in, for instance 7zip that can handle tar's without problems, you can see for your self what format they are in.

I don't think the 600 support HTML tho. But if it is HTML, you can convert it too something sensible most likely.
Exactly - I unzipped the file, and it comes mostly in separate html files in alphabetical folders.

There are also some jpg files.

The html files could be converted to epub - (though can't the 600 read html?)

but I don't think there is any way to combine the html and the jpgs.


If you simply put ALL the converted html files in to a single collection, you could search through them. I will try this with a few letters at first, to see how the speed works.

-----Is there any way to keep added files OUT of the 'all books' list? If you were to add wikipedia as separate files, you'd basically be forced to only use the 'collections' to find books (or search or author organize) - but this is not a big deal.

------On this note, does anyone know how to put jpgs in collections?

For instance, reading A Game of Thrones / Song of Ice and Fire series, sometimes you'd like to take a look at a map. If you could just go to the collection you have made, and have all the books and maps together, that'd be fantastic.
okalyddude is offline   Reply With Quote
Old 11-16-2009, 01:01 AM   #8
quillaja
Connoisseur
quillaja began at the beginning.
 
quillaja's Avatar
 
Posts: 79
Karma: 42
Join Date: Sep 2009
Device: sony reader touch (prs-600)
I don't think the Sony Readers can handle plain raw HTML.

Why don't you make the whole thing into one massive ePub? I assume each wikipedia article is exactly 1 HTML file. Those are (possibly) fine as they are, though if the HTML is just scraped from wikipedia as-is, there might be a lot of markup that epub can't accept. Anyway, just treat each wikipedia article as a "chapter" when you create the table of contents. You can also make a multi-level TOC for easy browsing.

For example:
A

AA
AB
AC...
B

BA
BB...
Of course, even if all this is in one massive epub file, you can still search it.

Images can be included in the epub itself, shown in the actual article, just like on a real webpage.

Of course, each and every file and image will have to be listed in the manifest.

I would imagine it's a lot of work, especially depending on how the images are organized in the file. However, it might be easier if you write scripts to construct the OPF and NCX files, etc.

Just some ideas. =)

Last edited by quillaja; 11-16-2009 at 01:03 AM.
quillaja is offline   Reply With Quote
Old 11-16-2009, 05:04 AM   #9
eksor
Connoisseur
eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.
 
eksor's Avatar
 
Posts: 94
Karma: 999884
Join Date: Jun 2009
Device: prs700, i-mate JAMin, smartq v7, GeeksPhone Zero, iPad 3rd Gen
calibre (calibre.kovidgoyal.net/download) will work fine. Just download the tar, uncompress it and load the top html in calibre, then convert it to lrf or epub.

But the file will be huge (note that a 3.5 Gb dvd is available, to get an idea) and will take a lot of time loading it into the device if finally ends. In fact you will ned a sd or a produo card to store it.

Alternatively you could use calibre web2disk from the site itself:

web2disk http://schools-wikipedia.org/wp/index/a.htm will download everything linked to a.htm (one level) and leave an index.htm in the download directory.
when finished:
ebook-convert index.htm a.lrf //or a.epub will convert all the linked htm files to a self contained lrf or epub file. Sadly, even if you repeat the same with b.htm no content will be linked with a.htm. In other words, you will not be able to navigate freely across the wikipedia if you download/convert a,b,c, etc.
I have tested this approach with the site of your interest with some subjects, (performers, for instance or WWII) and it works fine. Better results with epub format (some images missing with lrf).

Both web2disk and ebook-convert are command line utilities available with calibre software.

regards.
eksor is offline   Reply With Quote
Old 11-18-2009, 02:34 AM   #10
okalyddude
Enthusiast
okalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enough
 
Posts: 41
Karma: 602
Join Date: Oct 2009
Device: E600
Quote:
Originally Posted by eksor View Post
calibre (calibre.kovidgoyal.net/download) will work fine. Just download the tar, uncompress it and load the top html in calibre, then convert it to lrf or epub.

But the file will be huge (note that a 3.5 Gb dvd is available, to get an idea) and will take a lot of time loading it into the device if finally ends. In fact you will ned a sd or a produo card to store it.

Alternatively you could use calibre web2disk from the site itself:

web2disk http://schools-wikipedia.org/wp/index/a.htm will download everything linked to a.htm (one level) and leave an index.htm in the download directory.
when finished:
ebook-convert index.htm a.lrf //or a.epub will convert all the linked htm files to a self contained lrf or epub file. Sadly, even if you repeat the same with b.htm no content will be linked with a.htm. In other words, you will not be able to navigate freely across the wikipedia if you download/convert a,b,c, etc.
I have tested this approach with the site of your interest with some subjects, (performers, for instance or WWII) and it works fine. Better results with epub format (some images missing with lrf).

Both web2disk and ebook-convert are command line utilities available with calibre software.

regards.
Hmm ok, the version I have of the wikipedia does not have an indexed html for the letters.. merely a folder containing all the html and jpg files (and jpg files are not well labeled)

I will try to look at what you linked, and see if I can get it working that way.

I did intend to buy an expandable memory slot, but only if I got this working in an efficient manner.

In the way you're describing, I would have a single epub file for each letter? And the chapters catalogue would be links to all the html pages. How do you include the jpgs in the html pages? Are they already in the 'index' html? (i'm new and quite clueless to this, and have yet to try it out)

Then, with the search function of the 600, would you have to open the letter you want, then search, and the first result will be from the index?

Would there be any way to index them all in one file? Would the reader be able to handle this? Would it be able to handle the size of individual letters that are quite large?

I've been busy, so haven't had time to play around with the wikipedia version I have, I will check out the link you provided to see how the indexing works by letter...

The poster above you mentioned what I thought I'd have to do (as what I have is a bunch of randomly named html and jpg files) but it would be a ton of work - but I could also index them all in one, by letter, etc...
okalyddude is offline   Reply With Quote
Old 11-23-2009, 03:19 PM   #11
eksor
Connoisseur
eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.eksor ought to be getting tired of karma fortunes by now.
 
eksor's Avatar
 
Posts: 94
Karma: 999884
Join Date: Jun 2009
Device: prs700, i-mate JAMin, smartq v7, GeeksPhone Zero, iPad 3rd Gen
Hi again:

I thought that during the local morning I wrote something about this, never mind, I had a bad night and it was plenty of mistakes, I beg your pardon if you saw it.

Finally, I just tested this a few minutes again:

1) web2disk -r 1 http://schools-wikipedia.org/wp/inde...t.Music.Musica
l_Instruments.htm

This puts subject.Music.Musical_Instruments.xhtml in the working directory

1) web2disk -r 1 http://schools-wikipedia.org/wp/inde...nstruments.htm

This puts subject.Music.Musical_Instruments.xhtml in the same place

2) Then I edited test.html with this content

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8">
<TITLE></TITLE>
<META NAME="GENERATOR" CONTENT="OpenOffice.org 2.3 (Unix)">
<META NAME="CREATED" CONTENT="20091123;19424400">
<META NAME="CHANGED" CONTENT="20091123;19471300">
<STYLE TYPE="text/css">
<!--
@page { size: 21cm 29.7cm; margin: 2cm }
P { margin-bottom: 0.21cm }
-->
</STYLE>
</HEAD>
<BODY LANG="en-GB" DIR="LTR">
<P STYLE="margin-bottom: 0cm"><A HREF="subject.Music.Musical_Instruments.xhtml">
Instruments</A></P>
<P STYLE="margin-bottom: 0cm"><A HREF="subject.Music.Musical_Recordings_and_composi tions.xhtml">Recordings
and compositions</A></P>
</BODY>
</HTML>

3) finally ebook-convert test.html test.epub converted the html in a epub file with images links and so on in 3.6 MB. If you open it with ark or other archiver you can see the structer and the 762 files. Imagine the whole thing.

Conclusion, probably you could download the whole wiki for schools from the top page with web2disk, but:

1) Probably it is not exactly polite since a torrent is offered.
2) It would take a lot.
3) It would take a lot to convert.
4) The resulting epub will be huge.

Regards.

Quote:
Originally Posted by okalyddude View Post
Hmm ok, the version I have of the wikipedia does not have an indexed html for the letters.. merely a folder containing all the html and jpg files (and jpg files are not well labeled)

I will try to look at what you linked, and see if I can get it working that way.

I did intend to buy an expandable memory slot, but only if I got this working in an efficient manner.

In the way you're describing, I would have a single epub file for each letter? And the chapters catalogue would be links to all the html pages. How do you include the jpgs in the html pages? Are they already in the 'index' html? (i'm new and quite clueless to this, and have yet to try it out)

Then, with the search function of the 600, would you have to open the letter you want, then search, and the first result will be from the index?

Would there be any way to index them all in one file? Would the reader be able to handle this? Would it be able to handle the size of individual letters that are quite large?

I've been busy, so haven't had time to play around with the wikipedia version I have, I will check out the link you provided to see how the indexing works by letter...

The poster above you mentioned what I thought I'd have to do (as what I have is a bunch of randomly named html and jpg files) but it would be a ton of work - but I could also index them all in one, by letter, etc...
eksor is offline   Reply With Quote
Old 11-24-2009, 04:28 AM   #12
okalyddude
Enthusiast
okalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enoughokalyddude will become famous soon enough
 
Posts: 41
Karma: 602
Join Date: Oct 2009
Device: E600
Ok, what about not having a single epub file though? What about catagorizing, say all in one catagory, or by letter, all the individual html or epubs?

I realize that you could not browse between any of the articles this way. But you COULD do a search for the article you want, read it, then go back to reading whatever you were before. If I'm going to use offline wikipedia on my reader, it's likely going to be to just read a single article that I became curious about while reading.


Random question: is there ANY way to tweak the 'recently reading' thing on the prs to give you a choice of, say, the last 5 books? It already knows when you last read each book and what page you're on...

So far, if you want to switch quickly between 3 or 4 books you're reading, aside from browsing by catagory, author, or title, the only thing I can think of is to bookmark / highlight the last page you're on manually, then go to notes.
okalyddude is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Offline Wikipedia on DR1000S yet? Traveller iRex 0 05-31-2010 05:37 PM
Offline Wikipedia für Pocketbook menelic PocketBook 4 03-10-2010 04:03 AM
Offline wikipedia on pocketbooks logan PocketBook 5 01-02-2010 12:49 AM
offline Wikipedia on DR1000? Is it possible? whopper iRex 6 02-04-2009 03:41 AM
Offline Wikipedia on the Iliad Adam B. iRex 0 05-21-2008 11:18 AM


All times are GMT -4. The time now is 02:11 AM.


MobileRead.com is a privately owned, operated and funded community.