ePubBooks.com: Gulliver's Travels...with images

mikecook · 12-03-2008, 03:06 PM

I have just made available an EPUB version of Gulliver's Travel by Jonathan Swift over on the ePub Books Blog. I'm releasing this title as it contains lots of footnotes and images.

For the last few months I've been creating some conversion scripts to convert the Project Gutenberg TXT files into the epub format. I have now finished those and have made this title available for everyone to try out while I'm working on building the new website.

http://www.epubbooks.com/blog/200812...ub-ebook-test/

I would love to hear your feedback, on everything from the frontend formatting to the underlying XML coding.

kovidgoyal · 12-03-2008, 03:34 PM

Since you're using utf-8 encoding anyway, I suggest replacing the numeric entities with utf-8 characters. Makes for smaller file sizes and easier parsing.

Are you planning to release your scripts to convert the gutenberg txt markup to HTML? I've found that gutenberg books tend to have a lot of variation in their markup. How well does your script handle that?

mikecook · 12-03-2008, 04:32 PM

Quote:

Originally Posted by kovidgoyal

Since you're using utf-8 encoding anyway, I suggest replacing the numeric entities with utf-8 characters. Makes for smaller file sizes and easier parsing.

Thanks Kovid, I've put that on the list of updates to be made.

Quote:

Originally Posted by kovidgoyal

Are you planning to release your scripts to convert the gutenberg txt markup to HTML?

I actually use two scripts. The first is written in Perl, which converts from PG TXT to TEI, then I use XSL to convert into EPUB. My first XSL stylesheet actually produced HTML but it has moved on quite a lot since then. Saying that it shouldn't take too long to update the current to output HTML.

The answer to your question is yes....and no. I do hope to release this in the future but at the moment it is not really very user friendly/robust. I currently have 90 files converted (my test base) but once my site is live I will start churning out more titles, which means I can start to improve the scripts.

Quote:

Originally Posted by kovidgoyal

I've found that gutenberg books tend to have a lot of variation in their markup. How well does your script handle that?

My pg2tei.pl script does a pretty good job but like you say, their are a lot of variations. Often the PG chapter headings are inconsistent so sometimes I need to do a little preprocessing with a regex lookup and adding extra blank lines after each to help distinguish that they are chapter headings.

At the moment my footnote routines only do half their job...thankfully most files only have a few so it hasn't been such a big problem. Still, I will improve this soon.

I catch most quotes but some are missing/not included in the source and I still confuse single quotes used for word contractions. e.g. 'nothin' could help 'im save the world' Shouldn't be too hard to fix most, if not all of these.

The are currently two really big areas that need improvements to speed up conversion.

Frontmatter: I process as much as I can but I still need to include the original frontmatter (between the TEI header and first chapter) so I can double check everything and add any missing info into the teiHeader and front sections.

Images: I can automatically mark-up the TEI for images but the PG txt files don't actually have any filename information. For images I need to go through the HTML version and manually add these into the TEI. I should be able to add functionality into the script to read in the the files from disk and populate the TEI file, but this is still prone to errors.

Basically, there is always going to be some manual work needed, but I hope to reduce this to a minimum pretty quickly.

kovidgoyal · 12-03-2008, 04:57 PM

Since you parse the HTML versions for images anyway, why not use those as the source, since they are less inconsistent that the txt files, and only fallback to the txt when no html is present?

mikecook · 12-03-2008, 05:06 PM

Actually I parse the TXT version! ...does this make me crazy!

At the time it actually seemed like the HTML versions would present more problems than less. Yes chapters and paragraph were already done, but there's often a lot variations in other aspects. I don't believe this would have made things any easier. Plus there was potential for messy mark-up...I really wanted to keep things ultra clean.

Whether that was the right decision or not, I won't change things now.

kovidgoyal · 12-03-2008, 05:16 PM

If you're going for a manual conversion approach, then the TXT files make sense, since they will, as you say, yield cleaner epub files.

mikecook · 12-03-2008, 05:26 PM

Quote:

Originally Posted by kovidgoyal

If you're going for a manual conversion approach, then the TXT files make sense, since they will, as you say, yield cleaner epub files.

For sure, although I'm hoping to reduce the 'manual' labour to a minimum. It's as much about producing clean TEI/epub files as it is converting the PG catalogue. It will take longer to build up the ePub book catalogue, but I think the results will be well worth it.

kovidgoyal · 12-03-2008, 05:45 PM

Quote:

Originally Posted by mikecook

For sure, although I'm hoping to reduce the 'manual' labour to a minimum. It's as much about producing clean TEI/epub files as it is converting the PG catalogue. It will take longer to build up the ePub book catalogue, but I think the results will be well worth it.

It's certainly a worthy goal, wish you all the best.

kovidgoyal · 12-03-2008, 05:50 PM

Another suggestion: Add class="chapter_heading" to the chapter headings.

Also, if you are able to identify image captions, you should put that into the alt attribute of img tags instead of the generic Illustration and add class="image_caption" to the image captions in the text itself.

mikecook · 12-03-2008, 06:12 PM

Quote:

Originally Posted by kovidgoyal

Another suggestion: Add class="chapter_heading" to the chapter headings.

Also, if you are able to identify image captions, you should put that into the alt attribute of img tags instead of the generic Illustration and add class="image_caption" to the image captions in the text itself.

Can I ask why you recommend adding these classes? Am I right in thinking that it is only for descriptive purposes?

Some images in the PG files have both a caption and description so the TEI is marked up in this way. I can't now remember the reasoning for taking the alt attribute from the <figDesc> tag but perhaps this needs rethinking.

kovidgoyal · 12-03-2008, 07:53 PM

To make the HTML more semantic so that if someone wants to further process/convert the epub files or if a user wants to use a custom CSS stylesheet to view them (calibre's epub viewer allows this), it will be easier.

mikecook · 12-06-2008, 07:09 AM

Okay thanks Kovid, I will certainly give that some thought.

12-03-2008, 03:06 PM	#1
mikecook Enthusiast Posts: 35 Karma: 10 Join Date: Jun 2007 Location: United Kingdom Device: iPad Mini, Nexus 7, Sony Reader, Kindle, and others.	ePubBooks.com: Gulliver's Travels...with images I have just made available an EPUB version of Gulliver's Travel by Jonathan Swift over on the ePub Books Blog. I'm releasing this title as it contains lots of footnotes and images. For the last few months I've been creating some conversion scripts to convert the Project Gutenberg TXT files into the epub format. I have now finished those and have made this title available for everyone to try out while I'm working on building the new website. http://www.epubbooks.com/blog/200812...ub-ebook-test/ I would love to hear your feedback, on everything from the frontend formatting to the underlying XML coding.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Fantasy Swift, Jonathan: Gulliver's Travels. v5 10 Nov 2013	Jellby	ePub Books	7	11-10-2013 05:07 AM
Fantasy Swift, Jonathan: Gulliver's Travels. (Illustrated) V1. 24 May 2010	nrapallo	IMP Books (offline)	0	05-25-2010 01:08 AM
Swift, Jonathan: Gulliver's Travels. v1, 3 Jan 2008	Madam Broshkina	IMP Books	0	01-03-2008 05:31 PM
Swift, Jonathan: Gulliver's Travels. v1, 3 Jan 2008	Madam Broshkina	Kindle Books	0	01-03-2008 05:30 PM
Swift, Jonathan: Gulliver's Travels. v1, 3 Jan 2008	Madam Broshkina	BBeB/LRF Books	0	01-03-2008 05:27 PM

12-03-2008, 03:34 PM	#2
kovidgoyal creator of calibre Posts: 46,214 Karma: 29630732 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Since you're using utf-8 encoding anyway, I suggest replacing the numeric entities with utf-8 characters. Makes for smaller file sizes and easier parsing. Are you planning to release your scripts to convert the gutenberg txt markup to HTML? I've found that gutenberg books tend to have a lot of variation in their markup. How well does your script handle that?

12-03-2008, 04:57 PM	#4
kovidgoyal creator of calibre Posts: 46,214 Karma: 29630732 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Since you parse the HTML versions for images anyway, why not use those as the source, since they are less inconsistent that the txt files, and only fallback to the txt when no html is present?

12-03-2008, 05:06 PM	#5
mikecook Enthusiast Posts: 35 Karma: 10 Join Date: Jun 2007 Location: United Kingdom Device: iPad Mini, Nexus 7, Sony Reader, Kindle, and others.	Actually I parse the TXT version! ...does this make me crazy! At the time it actually seemed like the HTML versions would present more problems than less. Yes chapters and paragraph were already done, but there's often a lot variations in other aspects. I don't believe this would have made things any easier. Plus there was potential for messy mark-up...I really wanted to keep things ultra clean. Whether that was the right decision or not, I won't change things now.

12-03-2008, 05:16 PM	#6
kovidgoyal creator of calibre Posts: 46,214 Karma: 29630732 Join Date: Oct 2006 Location: Mumbai, India Device: Various	If you're going for a manual conversion approach, then the TXT files make sense, since they will, as you say, yield cleaner epub files.

12-03-2008, 05:50 PM	#9
kovidgoyal creator of calibre Posts: 46,214 Karma: 29630732 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Another suggestion: Add class="chapter_heading" to the chapter headings. Also, if you are able to identify image captions, you should put that into the alt attribute of img tags instead of the generic Illustration and add class="image_caption" to the image captions in the text itself.

12-03-2008, 07:53 PM	#11
kovidgoyal creator of calibre Posts: 46,214 Karma: 29630732 Join Date: Oct 2006 Location: Mumbai, India Device: Various	To make the HTML more semantic so that if someone wants to further process/convert the epub files or if a user wants to use a custom CSS stylesheet to view them (calibre's epub viewer allows this), it will be easier.

12-06-2008, 07:09 AM	#12
mikecook Enthusiast Posts: 35 Karma: 10 Join Date: Jun 2007 Location: United Kingdom Device: iPad Mini, Nexus 7, Sony Reader, Kindle, and others.	Okay thanks Kovid, I will certainly give that some thought.

Advert

Advert