Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book General > News

Notices

Reply
 
Thread Tools Search this Thread
Old 12-03-2008, 03:06 PM   #1
mikecook
Enthusiast
mikecook began at the beginning.
 
mikecook's Avatar
 
Posts: 35
Karma: 10
Join Date: Jun 2007
Location: United Kingdom
Device: iPad Mini, Nexus 7, Sony Reader, Kindle, and others.
ePubBooks.com: Gulliver's Travels...with images

I have just made available an EPUB version of Gulliver's Travel by Jonathan Swift over on the ePub Books Blog. I'm releasing this title as it contains lots of footnotes and images.

For the last few months I've been creating some conversion scripts to convert the Project Gutenberg TXT files into the epub format. I have now finished those and have made this title available for everyone to try out while I'm working on building the new website.

http://www.epubbooks.com/blog/200812...ub-ebook-test/

I would love to hear your feedback, on everything from the frontend formatting to the underlying XML coding.
mikecook is offline   Reply With Quote
Old 12-03-2008, 03:34 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,199
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Since you're using utf-8 encoding anyway, I suggest replacing the numeric entities with utf-8 characters. Makes for smaller file sizes and easier parsing.

Are you planning to release your scripts to convert the gutenberg txt markup to HTML? I've found that gutenberg books tend to have a lot of variation in their markup. How well does your script handle that?
kovidgoyal is offline   Reply With Quote
Advert
Old 12-03-2008, 04:32 PM   #3
mikecook
Enthusiast
mikecook began at the beginning.
 
mikecook's Avatar
 
Posts: 35
Karma: 10
Join Date: Jun 2007
Location: United Kingdom
Device: iPad Mini, Nexus 7, Sony Reader, Kindle, and others.
Quote:
Originally Posted by kovidgoyal View Post
Since you're using utf-8 encoding anyway, I suggest replacing the numeric entities with utf-8 characters. Makes for smaller file sizes and easier parsing.
Thanks Kovid, I've put that on the list of updates to be made.

Quote:
Originally Posted by kovidgoyal View Post
Are you planning to release your scripts to convert the gutenberg txt markup to HTML?
I actually use two scripts. The first is written in Perl, which converts from PG TXT to TEI, then I use XSL to convert into EPUB. My first XSL stylesheet actually produced HTML but it has moved on quite a lot since then. Saying that it shouldn't take too long to update the current to output HTML.

The answer to your question is yes....and no. I do hope to release this in the future but at the moment it is not really very user friendly/robust. I currently have 90 files converted (my test base) but once my site is live I will start churning out more titles, which means I can start to improve the scripts.

Quote:
Originally Posted by kovidgoyal View Post
I've found that gutenberg books tend to have a lot of variation in their markup. How well does your script handle that?
My pg2tei.pl script does a pretty good job but like you say, their are a lot of variations. Often the PG chapter headings are inconsistent so sometimes I need to do a little preprocessing with a regex lookup and adding extra blank lines after each to help distinguish that they are chapter headings.

At the moment my footnote routines only do half their job...thankfully most files only have a few so it hasn't been such a big problem. Still, I will improve this soon.

I catch most quotes but some are missing/not included in the source and I still confuse single quotes used for word contractions. e.g. 'nothin' could help 'im save the world' Shouldn't be too hard to fix most, if not all of these.

The are currently two really big areas that need improvements to speed up conversion.

Frontmatter: I process as much as I can but I still need to include the original frontmatter (between the TEI header and first chapter) so I can double check everything and add any missing info into the teiHeader and front sections.

Images: I can automatically mark-up the TEI for images but the PG txt files don't actually have any filename information. For images I need to go through the HTML version and manually add these into the TEI. I should be able to add functionality into the script to read in the the files from disk and populate the TEI file, but this is still prone to errors.

Basically, there is always going to be some manual work needed, but I hope to reduce this to a minimum pretty quickly.
mikecook is offline   Reply With Quote
Old 12-03-2008, 04:57 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,199
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Since you parse the HTML versions for images anyway, why not use those as the source, since they are less inconsistent that the txt files, and only fallback to the txt when no html is present?
kovidgoyal is offline   Reply With Quote
Old 12-03-2008, 05:06 PM   #5
mikecook
Enthusiast
mikecook began at the beginning.
 
mikecook's Avatar
 
Posts: 35
Karma: 10
Join Date: Jun 2007
Location: United Kingdom
Device: iPad Mini, Nexus 7, Sony Reader, Kindle, and others.
Actually I parse the TXT version! ...does this make me crazy!

At the time it actually seemed like the HTML versions would present more problems than less. Yes chapters and paragraph were already done, but there's often a lot variations in other aspects. I don't believe this would have made things any easier. Plus there was potential for messy mark-up...I really wanted to keep things ultra clean.

Whether that was the right decision or not, I won't change things now.
mikecook is offline   Reply With Quote
Advert
Old 12-03-2008, 05:16 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,199
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
If you're going for a manual conversion approach, then the TXT files make sense, since they will, as you say, yield cleaner epub files.
kovidgoyal is offline   Reply With Quote
Old 12-03-2008, 05:26 PM   #7
mikecook
Enthusiast
mikecook began at the beginning.
 
mikecook's Avatar
 
Posts: 35
Karma: 10
Join Date: Jun 2007
Location: United Kingdom
Device: iPad Mini, Nexus 7, Sony Reader, Kindle, and others.
Quote:
Originally Posted by kovidgoyal View Post
If you're going for a manual conversion approach, then the TXT files make sense, since they will, as you say, yield cleaner epub files.
For sure, although I'm hoping to reduce the 'manual' labour to a minimum. It's as much about producing clean TEI/epub files as it is converting the PG catalogue. It will take longer to build up the ePub book catalogue, but I think the results will be well worth it.
mikecook is offline   Reply With Quote
Old 12-03-2008, 05:45 PM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,199
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Quote:
Originally Posted by mikecook View Post
For sure, although I'm hoping to reduce the 'manual' labour to a minimum. It's as much about producing clean TEI/epub files as it is converting the PG catalogue. It will take longer to build up the ePub book catalogue, but I think the results will be well worth it.
It's certainly a worthy goal, wish you all the best.
kovidgoyal is offline   Reply With Quote
Old 12-03-2008, 05:50 PM   #9
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,199
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Another suggestion: Add class="chapter_heading" to the chapter headings.

Also, if you are able to identify image captions, you should put that into the alt attribute of img tags instead of the generic Illustration and add class="image_caption" to the image captions in the text itself.
kovidgoyal is offline   Reply With Quote
Old 12-03-2008, 06:12 PM   #10
mikecook
Enthusiast
mikecook began at the beginning.
 
mikecook's Avatar
 
Posts: 35
Karma: 10
Join Date: Jun 2007
Location: United Kingdom
Device: iPad Mini, Nexus 7, Sony Reader, Kindle, and others.
Quote:
Originally Posted by kovidgoyal View Post
Another suggestion: Add class="chapter_heading" to the chapter headings.

Also, if you are able to identify image captions, you should put that into the alt attribute of img tags instead of the generic Illustration and add class="image_caption" to the image captions in the text itself.
Can I ask why you recommend adding these classes? Am I right in thinking that it is only for descriptive purposes?

Some images in the PG files have both a caption and description so the TEI is marked up in this way. I can't now remember the reasoning for taking the alt attribute from the <figDesc> tag but perhaps this needs rethinking.
mikecook is offline   Reply With Quote
Old 12-03-2008, 07:53 PM   #11
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,199
Karma: 27110894
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
To make the HTML more semantic so that if someone wants to further process/convert the epub files or if a user wants to use a custom CSS stylesheet to view them (calibre's epub viewer allows this), it will be easier.
kovidgoyal is offline   Reply With Quote
Old 12-06-2008, 07:09 AM   #12
mikecook
Enthusiast
mikecook began at the beginning.
 
mikecook's Avatar
 
Posts: 35
Karma: 10
Join Date: Jun 2007
Location: United Kingdom
Device: iPad Mini, Nexus 7, Sony Reader, Kindle, and others.
Okay thanks Kovid, I will certainly give that some thought.
mikecook is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Fantasy Swift, Jonathan: Gulliver's Travels. v5 10 Nov 2013 Jellby ePub Books 7 11-10-2013 05:07 AM
Fantasy Swift, Jonathan: Gulliver's Travels. (Illustrated) V1. 24 May 2010 nrapallo IMP Books (offline) 0 05-25-2010 01:08 AM
Swift, Jonathan: Gulliver's Travels. v1, 3 Jan 2008 Madam Broshkina IMP Books 0 01-03-2008 05:31 PM
Swift, Jonathan: Gulliver's Travels. v1, 3 Jan 2008 Madam Broshkina Kindle Books 0 01-03-2008 05:30 PM
Swift, Jonathan: Gulliver's Travels. v1, 3 Jan 2008 Madam Broshkina BBeB/LRF Books 0 01-03-2008 05:27 PM


All times are GMT -4. The time now is 12:17 AM.


MobileRead.com is a privately owned, operated and funded community.