View Full Version : Help on project to create ePub of the online Stanford Encyclopedia of Philosophy


cptnemo
06-29-2012, 06:41 PM
Hello,

I have since long been a big user of the Stanford Encyclopedia of Philosophy (http://plato.stanford.edu/), an open collection of about 1300 articles on philosophers and philosophical topics.

What I would like to create is an ePub version of the website so to bring the Encyclopedia always in my pocket.

What I have done so far is to download the articles with wget (about 270 mb) with these options: I excluded js, txt and css files, I got all the images necessary to display the pages, I did not create directories but importantly I asked wget to put all files in one folder (with no subdirectories) and to correct all internal links accordingly.

Basically, I created a functioning mono-folder website on my desktop. I though the next step would have been easy: I was wrong.

I tried:

1) to handcraft an ePub adding the required XML files and folder structure (mimetype, OEBPS, META-INF, etc.) following these instructions (http://www.jedisaber.com/eBooks/formatsource.shtml) (I add over 4000 elements to Content.opf), and then to create my .zip/epub file with this nice applescript (http://www.mobileread.com/forums/attachment.php?attachmentid=36026&d=1253223157). But the ePub does not work: I unsuccessfully tried to open it with Sigil (it crashed) and with Calibre (it opens but internal links of the ePub do not work, they point to files on my desktop not in the ePub).

2) I tried to convert my handcrafted ePub with Calibre into another ePub. Calibre crashed.

3) I tried to create an ePub with Calibre opening the contents.html file from desktop. Calibre crashed.

Do you think my file is to big? Should I go through some preprocess with my html files before converting?

Jellby
06-30-2012, 04:02 AM
and with Calibre (it opens but internal links of the ePub do not work, they point to files on my desktop not in the ePub).

Make sure the internal links in your files are relative links, not absolute.

As for other problems, my advice is to start with a small portion of the pages, so it is easier to test and debug.

mmat1
06-30-2012, 05:44 AM
... (about 270 mb) ...
...(I add over 4000 elements to Content.opf)...


... maybe a bit too large :)

mrmikel
06-30-2012, 06:50 AM
Even if you get it to work, VERY SLOWLY, the Table of Contents may be larger than most entire books, especially when displayed on an small screen.

Certainly there are topic areas which lend themselves to creating books by themselves. At least in the Sony Readers you can create Collections, this sort of thing may be helpful.

Make sure the links don't point outside of the device, as some devices are not connected to the internet all the time or accept references to the web.

The Harvard Classics, which your project sort of reminds me of, takes 50 paper books and the ereader files keep the same number of books.

Links are case-sensitive in epubs, so you will have to watch for that too.

You might try just copying and pasting articles one at a time into Sigil, making sure to insert chapter breaks so that some readers will not fail because the sections are over about 250k.

You could use something similar to HTTrack to get the whole web site on your computer, so you have access to all the bits and pieces as you construct your book.

AlPe
06-30-2012, 08:18 AM
Hi cptnemo,

the previous observations are correct: besides technical things, like internal links and the like (have you tried validating your EPUB, after creating it?), the file/TOC/OPF-content size might be a problem for most readers. I second the previous suggestion of trying splitting the files in smaller EPUBs, e.g., by starting letter.

I did a similar, but smaller in size, project by fetching the Divina Commedia with several commentaries from the Dartmouth Dante Project (you can get a proof-of-concept here: here (http://forum.simplicissimus.it/consigli-di-lettura/divina-commedia/msg93692/#msg93692)), and packing it into a single EPUB (the resulting EPUB is roughly 3 MB, with 100 XHTML pages, and 10,000 notes/comments).

I am quite busy these days, but mid next-week I could give a look at it, if you like to work on this 'project'.

Zeno_
08-25-2013, 03:50 PM
I got the same idea as the TS.
Because I just convert the HTML files to ePub, the file won't be that big.

So it can be done, but there are some problems.

As of now:
- Some logic symbols aren't rendered correctly.
- There is a lot of gibberish before the article starts. I would be able to remove it if I find a way to add extra CSS to every HTML file. This could be a setting from a mass downloader, appending it to every HTML file it downloads, or maybe someone knows a way to add something to a lot of HTML files.
- The TOC is HUGE.
- The title of every subject is repeated twice, one webpage title (with Stanford etc. etc. added) and the title of the topic.
- Internal links don't work, only an index file does (can be solved manually, I can convert words that contain an exact title of another topic automatically).

Solvable problems, not even big problems, but if someone can help me with the second one in particular, that would be great.

AlPe
08-25-2013, 03:58 PM
if you look at the source code of one article page, you will notice that the useful stuff is contained within two comments:


<div id="aueditable"><!--DO NOT MODIFY THIS LINE AND ABOVE-->
<h1>Abduction</h1><div id="pubinfo"><em>First published Wed Mar 9, 2011</em></div>
...
</div><!-- #aueditable --><!--DO NOT MODIFY THIS LINE AND BELOW-->


MY suggestion is to download every article, then grab the part of the source within those comments, and create your own header/footer, and a CSS where you will use the classes defined in the original CSS of the SEoP articles.

AlPe
08-26-2013, 03:05 PM
A quick Bash script told me that currently there are 1385 articles linked from the online TOC ( http://plato.stanford.edu/contents.html ), and that the uncompressed size of all the HTML pages is ~127 MB.

If you are going to create a single EPUB, I doubt eReaders will open it, due to its size. Tablet apps might be ok.

In case, do not put every entry in the TOC. Just put "Letter A", "Letter B", etc. linking to an XHTML page that contains links to the entries. Probably a 2 level TOC ("A" > "AB", "AC", ... | "B" > "BA", "BE", etc.) would still be manageble by most reading systems.

EDIT: from a quick test, the EPUB containing all the articles is around ~40MB. My iPad opens it, but it is very slow at opening it from the library, and to navigate the TOC. My Kobo Glo is simply not able to even open it (or, more precisely, seems still trying, after ~5 minutes)

EDIT: with a bit of labor, one can remove the junk and get a decent EPUB, still slow to load from library, but at least the navigation is not untolerably slow. See attached screenshots.

Zeno_
08-31-2013, 01:10 PM
I underestimated how many entries there were :eek:
Now I understand the problem with the file size.
Kinda defeats the purpose of trying for me if I can't open it with my PRS-T1.

For iPad, I'd reccomend an app like Offline pages (https://itunes.apple.com/nl/app/offline-pages-offline-web/id364859644?mt=8) or something similar.
Then you can just download a page every now and then if you need it offline.
For iPhone there is also an app from Stanford (http://plato.stanford.edu/tools/iPhoneReader/index.html), but it requires an internet connection to browse.

AlPe
09-01-2013, 05:47 AM
I underestimated how many entries there were :eek:
Now I understand the problem with the file size.
Kinda defeats the purpose of trying for me if I can't open it with my PRS-T1.


But you can either:

1) select only the articles you want to read and bundle them into a single EPUB file (like Wikipedia allows) or

2) split the entire encyclopedia in multiple EPUB files.

I do not have my T1 with me, but my Kobo Glo has just accepted an EPUB of size 5 MB, with a subset of the articles. <10 such chuncks should be enough for the whole SEoP. Clearly you loose the chance of clicking internal links if you are on an A-article and you want to jump to a Z-article, but you gain the possibility of reading it on e-ink, which is still very nice.

Actually, I was thinking of writing an email to the SEoP guys, asking whether they could provide an "official" EPUB file (or multiple chunks). I always find odd that universities/foundations put data online, without protection from screenscraping, but they do not provide access to the raw data nor produce reasonable output formats --- unlike advertise-based sites, they do not gain anything from having people visiting their Web pages.

mrmikel
09-01-2013, 06:47 AM
Since it is an academic document, it is probably littered with footnotes, etc. Since it is for private use, why not just dump them all overboard? It might reduce the book to a reasonable size. You might get rid of indexes, etc also, if the articles are all old friends.

You can always look up things you have questions about when you get to a computer.