Feedbooks to support ePub format - Page 3

Hadrien · 09-30-2007, 06:21 PM

Here's a screenshot in FBReader. CSS and TOC are not yet supported in FBReader but overall, it works fine (I love the fact that hyphenation is software based on FBReader).

For those of you using an iLiad, this should be sweet: you'll be able to directly download our epub files using our iLiad software and open it thanks to the next port of FBReader.

DaleDe · 09-30-2007, 08:35 PM

Quote:

Originally Posted by Hadrien

OK, I'll try it with IE too, see if there's any error in the xml file, but it should be 100% valid xhtml now that I've fixed everything and run it through tidy.

You might try renaming the file to a .xhtml extension and see if IE likes it better. Raw xml needs more that just a stylesheet to be read correctly.

Dale

bowerbird · 10-01-2007, 12:07 PM

hadrien, your books look very nice! congratulations!

on average, how long does it take you to work up a book,
say from project gutenberg, to put into your database?
5-10 minutes, 15-30 minutes, 1-2 hours, 2-4 hours?

-bowerbird

Hadrien · 10-01-2007, 01:30 PM

Quote:

Originally Posted by bowerbird

hadrien, your books look very nice! congratulations!

on average, how long does it take you to work up a book,
say from project gutenberg, to put into your database?
5-10 minutes, 15-30 minutes, 1-2 hours, 2-4 hours?

-bowerbird

It's a 5-15mn thing... Unless you're adding War & Peace or Les Misérables of course ^_^

The good thing is that unlike fully manually created books, as soon as we add a new output, it's available on ALL of our books (and we still get a full TOC, footnotes etc...). And we also make advanced use of the metadata: you can browse the website in many different ways, we've got an API that makes it possible for any application or website to interact with Feedbooks (our iLiad application for example) and a personal recommendation system.

Anyone can contribute to adding books on Feedbooks: making the process easier will be one of our goals in the upcoming months.

Next output will be something totally different, not e-paper related and it should appeal to another crowd too.

bowerbird · 10-01-2007, 02:30 PM

hadrien-

thanks...

i did notice that, on the older project gutenberg e-texts, which
used all-upper-case to indicate italics, you haven't fixed that...

where can i get information on your a.p.i. for external apps?

-bowerbird

andym · 10-01-2007, 02:38 PM

Quote:

Originally Posted by Hadrien

It's a 5-15mn thing... Unless you're adding War & Peace or Les Misérables of course ^_^.

Out of interest (I've just been spending way too much time restoring the accents in the PG text of Nostromo)). Do you have dictionary software that will restore accents automatically?

Hadrien · 10-01-2007, 04:08 PM

Quote:

Originally Posted by andym

Out of interest (I've just been spending way too much time restoring the accents in the PG text of Nostromo)). Do you have dictionary software that will restore accents automatically?

Well... We're using a dictionnary for hyphenation on PDF files. We're not changing any accents yet, guess it could be added on our todo list for preprocessing with also curly quotes.

bowerbird: On Project Gutenberg, italics are indicated with _ not all caps. I'll take a look at what all caps is used for exactly, guess that's another thing that we could add to our preprocessing.

bowerbird · 10-01-2007, 05:52 PM

actually, hadrien, i am extremely familiar with project gutenberg e-texts.
and the one thing i can tell you is that they're _consistently_ inconsistent.

so yes, some early books used all-caps for italics, rather than underscores.
and along the way, a variety of characters were used beside underscores...
and up until 2003 or so, when i became a severe pain-in-the-neck to them
on these issues, they didn't even feel any need to mark italics consistently...

even worse, they used all-caps for bold as well, and likewise felt no need
to be consistent with that either. (sometimes they didn't mark bold at all.)

i know all this because i have been working for some time now on means of
interpreting the p.g. e-texts in a way that restores the structural information.
the same type of work you do when you put texts into your database, except
i leave them as text. (so ordinary humans can continue to work with them...)

i've invented a form of non-markup markup -- i call it "zen markup language",
or z.m.l. (it's two steps more advanced than x.m.l.) -- where such structural
information is represented by a simple set of unobtrusive light-markup rules.

for instance, a regular chapter-header is preceded by 4 blank lines and followed
by 2 blank lines, thus allowing a viewer-application (which i've also programmed)
to automatically form a table of contents that is auto-hot-linked to the chapters...

other simple rules -- easy enough to be understood by a fourth-grader --
underlie all of the other structures that are commonly found in books...
you can see work that i've done, in action, by visiting this web-page:
> http://z-m-l.com/go/vl3.pl
you'll be particular interested in the "test-suite" and "rules" examples...

i believe intelligent viewer-programs intepreting plain-ascii input e-texts
and presenting them in typographically-sophisticated ways is _the_ future.

the publishing companies, of course, in an attempt to raise the cost of entry,
will try to force e-books into the complexity of heavy-markup, but i believe
the revolution into self-publishing will push back with light-markup systems.
authors don't want to battle steep learning curves. they just want to write...

-bowerbird

akiburis · 10-01-2007, 07:33 PM

There may actually be some consistency, at least, in PG's inconsistency. In some texts, they seem to distinguish between italics used in the original for emphasis, represented in the PG text by all caps, and italics used for other purposes (setting off foreign words and phrases, titles, etc), represented in the PG text by fore-and-aft underscores.

PG texts also use all caps to represent original small caps and caps-and-small.

bowerbird · 10-01-2007, 09:36 PM

could be. it's hard to know without looking at the scans.
and even if you have the scans, the fact that p.g. has
rewrapped the text makes it hard to do the comparison.
it ends up it's easier to re-o.c.r., and use the p.g. e-text
to do corrections. thank goodness google is scanning...

and it ends up that leaving the all-upper-case words is
not all that bad. it accomplishes the emphasis purpose.

but there are a raft of problems like this, such as the
failure to indicate the lines that shouldn't be wrapped
(e.g., in address-blocks, tables, signature-blocks, etc.)

oh well, it's been a puzzle to occupy my mind... :+)

-bowerbird

DaleDe · 10-02-2007, 01:09 AM

Many of the problems are due to the idea that you can exchange data in text format. This is fallacious for books, particular novels where dialog is involved. Most ever book I post takes extensive looks and modification to fix things that were already supposed to be ok.

Dale

bowerbird · 10-02-2007, 04:40 AM

dale, i'm not sure i understand your point. got any examples?

-bowerbird

andym · 10-02-2007, 04:57 AM

Quote:

Originally Posted by bowerbird

actually, hadrien, i am extremely familiar with project gutenberg e-texts.
and the one thing i can tell you is that they're _consistently_ inconsistent.

so yes, some early books used all-caps for italics, rather than underscores.
and along the way, a variety of characters were used beside underscores...
and up until 2003 or so, when i became a severe pain-in-the-neck to them
on these issues, they didn't even feel any need to mark italics consistently...

even worse, they used all-caps for bold as well, and likewise felt no need
to be consistent with that either. (sometimes they didn't mark bold at all.)

Amen to all of that. Though be grateful for the fact that the text is out there at all and you don't have to OCR it yoursel! Also you can see the issue from the point of view of the original transcribers as well. For example I've just been restoring the italics in the PG text of Nostromo, and very often the transcriber users initial caps for a word that was originally in italics - probably a more elegant and reader-friendly solution than using forward slashes for italicized words.

Quote:

i've invented a form of non-markup markup -- i call it "zen markup language",
or z.m.l. (it's two steps more advanced than x.m.l.) -- where such structural
information is represented by a simple set of unobtrusive light-markup rules.

for instance, a regular chapter-header is preceded by 4 blank lines and followed
by 2 blank lines, thus allowing a viewer-application (which i've also programmed)
to automatically form a table of contents that is auto-hot-linked to the chapters...

other simple rules -- easy enough to be understood by a fourth-grader --
underlie all of the other structures that are commonly found in books...
you can see work that i've done, in action, by visiting this web-page:
> http://z-m-l.com/go/vl3.pl
you'll be particular interested in the "test-suite" and "rules" examples...

i believe intelligent viewer-programs intepreting plain-ascii input e-texts
and presenting them in typographically-sophisticated ways is _the_ future.

the publishing companies, of course, in an attempt to raise the cost of entry,
will try to force e-books into the complexity of heavy-markup, but i believe
the revolution into self-publishing will push back with light-markup systems.
authors don't want to battle steep learning curves. they just want to write...

-bowerbird

I don't understand why you would need a new mark-up, correctly used, html mark-up [eg h1 for the book title h2 for the part or section title and h3 for the chapter] gives you all the semantic information you need. (Poetry is another story). Personally I believe that plain vanilla html (or its baby siblings markdown, textile etc) is the new ascii.

bowerbird · 10-02-2007, 05:41 AM

andy said:
> Though be grateful for the fact that
> the text is out there at all and
> you don't have to OCR it yourself!

well heck, i'm _extremely_ grateful for project gutenberg.
as the forerunner of _all_ the net collaboration projects,
including wikipedia, it has _tremendous_ value to me...

so that's first and foremost.

having said that, however, o.c.r. ain't difficult these days.
scanning (and all that it entails, including rounding up
a hard-copy to scan) is the hardest part of the equation,
and google (and others) are taking care of all that hassle.

but yeah, as i said, correcting that o.c.r. is where all the
p.g. e-texts will come in handy, in the next cyberlibrary.

> Also you can see the issue from the point of view
> of the original transcribers as well. For example
> I've just been restoring the italics in the PG text of Nostromo,
> and very often the transcriber users initial caps for a word
> that was originally in italics - probably a more elegant and
> reader-friendly solution than using forward slashes for italicized words.

well, maybe. the problem is, though, that it's an ambiguous coding,
so it becomes impossible to restore things to their original state...

a forward-slashes method -- while maybe not "reader-friendly" --
would have at least been unambiguous enough to easily un-do...

> I don't understand why you would need a new mark-up,
> correctly used, html mark-up [eg h1 for the book title
> h2 for the part or section title and h3 for the chapter]
> gives you all the semantic information you need.

well, the problem with .html is that its obtrusive markup makes it
hard to maintain (e.g., correct, edit, compare, update, re-mix, etc.),
as well as to read in the underlying "master" format.

do a view-source on this page:
> http://z-m-l.com/go/test-suite.html

then compare that source-html to this page:
> http://z-m-l.com/go/test-suite.zml

particularly since the .zml file actually _generated_ the .html one,
i think it's pretty easy to tell which file would be easier to maintain,
especially with a library of thousands of e-texts (let alone millions).

and then of course when you ratchet up the difficulty to the level of
the .epub format, where each e-text file needs accompanying files,
you're just asking for trouble. in my view, complex formats like that
are simply the old-guard dinosaur publishing-houses attempting to
raise the cost-of-entry for us "amateur" newbies, whose new capacity
for self-publishing will totally and completely subvert their business.
they're attempting to find a way to maintain their status as middlemen,
so they can continue to siphon off a good percentage of the revenue...

> Personally I believe that plain vanilla html
> (or its baby siblings markdown, textile etc) is the new ascii.

markdown and textfile are both light-markup systems,
and thus of the same type as my zen markup language.
(except my z.m.l. is even less obtrusive than they are.)

but yes, this is the way of the future. authors want to write,
not be caught up in unnecessary complexities of file-formats.

-bowerbird

DaleDe · 10-02-2007, 10:41 AM

Quote:

Originally Posted by bowerbird

dale, i'm not sure i understand your point. got any examples?

-bowerbird

For example dialogs often attempt to indicate pauses and interruptions. In some cases this is done with a dash symbol. In the recent biography of Buffalo Bill that I just posted the source used hyphens for everything. In some cases I have seen pg books with double hyphens which is easy to deal with but in this book only single hyphens were used everywhere. The book was a mess of real hyphens needed for compound words and hyphens used when a dash was needed. I had to manually find every instance and make a decision in each case. In formatting a hyphen will hyphenate at the end of a line but a dash typically will not, causing ugly breaks in the text flow.

Other dialog problems include accent marks and trying to show dialects in the text. These are tough with a full font collection and are made much more difficult using only ascii characters. Bold, italics and special symbols get lost in translation to ascii. Surely you have noticed this.

Many period books use unusual spelling and other specialized but unusual constructions with foreign words that can fool spell checkers requiring intervention that seems not to get done in the process.

Dale

09-30-2007, 06:21 PM	#31
Hadrien Feedbooks.com Co-Founder Posts: 2,263 Karma: 145123 Join Date: Nov 2006 Location: Paris, France Device: Sony PRS-t-1/350/300/500/505/600/700, Nexus S, iPad	Here's a screenshot in FBReader. CSS and TOC are not yet supported in FBReader but overall, it works fine (I love the fact that hyphenation is software based on FBReader). For those of you using an iLiad, this should be sweet: you'll be able to directly download our epub files using our iLiad software and open it thanks to the next port of FBReader. Attached Thumbnails

10-01-2007, 12:07 PM	#33
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	hadrien, your books look very nice! congratulations! on average, how long does it take you to work up a book, say from project gutenberg, to put into your database? 5-10 minutes, 15-30 minutes, 1-2 hours, 2-4 hours? -bowerbird Last edited by bowerbird; 10-01-2007 at 12:09 PM.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Feedbooks epub problem?	tg2010	ePub	2	12-28-2009 05:30 AM
ePub on the iPhone with Stanza/Feedbooks	Hadrien	Apple Devices	70	11-21-2008 12:15 PM
O'Reilly to support multi-format e-books, goes ePub	Alexander Turcic	News	30	06-20-2008 10:58 PM
Mobipocket/Kindle support on Feedbooks	Hadrien	Deals and Resources (No Self-Promotion or Affiliate Links)	19	12-20-2007 11:44 PM
PRS-500 Template & extended font support at Feedbooks (poll)	Hadrien	Sony Reader Dev Corner	9	05-12-2007 12:04 PM

10-01-2007, 02:30 PM	#35
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	hadrien- thanks... i did notice that, on the older project gutenberg e-texts, which used all-upper-case to indicate italics, you haven't fixed that... where can i get information on your a.p.i. for external apps? -bowerbird

10-01-2007, 05:52 PM	#38
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	actually, hadrien, i am extremely familiar with project gutenberg e-texts. and the one thing i can tell you is that they're _consistently_ inconsistent. so yes, some early books used all-caps for italics, rather than underscores. and along the way, a variety of characters were used beside underscores... and up until 2003 or so, when i became a severe pain-in-the-neck to them on these issues, they didn't even feel any need to mark italics consistently... even worse, they used all-caps for bold as well, and likewise felt no need to be consistent with that either. (sometimes they didn't mark bold at all.) i know all this because i have been working for some time now on means of interpreting the p.g. e-texts in a way that restores the structural information. the same type of work you do when you put texts into your database, except i leave them as text. (so ordinary humans can continue to work with them...) i've invented a form of non-markup markup -- i call it "zen markup language", or z.m.l. (it's two steps more advanced than x.m.l.) -- where such structural information is represented by a simple set of unobtrusive light-markup rules. for instance, a regular chapter-header is preceded by 4 blank lines and followed by 2 blank lines, thus allowing a viewer-application (which i've also programmed) to automatically form a table of contents that is auto-hot-linked to the chapters... other simple rules -- easy enough to be understood by a fourth-grader -- underlie all of the other structures that are commonly found in books... you can see work that i've done, in action, by visiting this web-page: > http://z-m-l.com/go/vl3.pl you'll be particular interested in the "test-suite" and "rules" examples... i believe intelligent viewer-programs intepreting plain-ascii input e-texts and presenting them in typographically-sophisticated ways is _the_ future. the publishing companies, of course, in an attempt to raise the cost of entry, will try to force e-books into the complexity of heavy-markup, but i believe the revolution into self-publishing will push back with light-markup systems. authors don't want to battle steep learning curves. they just want to write... -bowerbird

10-01-2007, 07:33 PM	#39
akiburis Connoisseur Posts: 66 Karma: 614 Join Date: Jul 2007 Location: New York Device: Sony PRS-505, iLiad Book Edition	There may actually be some consistency, at least, in PG's inconsistency. In some texts, they seem to distinguish between italics used in the original for emphasis, represented in the PG text by all caps, and italics used for other purposes (setting off foreign words and phrases, titles, etc), represented in the PG text by fore-and-aft underscores. PG texts also use all caps to represent original small caps and caps-and-small.

10-01-2007, 09:36 PM	#40
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	could be. it's hard to know without looking at the scans. and even if you have the scans, the fact that p.g. has rewrapped the text makes it hard to do the comparison. it ends up it's easier to re-o.c.r., and use the p.g. e-text to do corrections. thank goodness google is scanning... and it ends up that leaving the all-upper-case words is not all that bad. it accomplishes the emphasis purpose. but there are a raft of problems like this, such as the failure to indicate the lines that shouldn't be wrapped (e.g., in address-blocks, tables, signature-blocks, etc.) oh well, it's been a puzzle to occupy my mind... :+) -bowerbird

10-02-2007, 01:09 AM	#41
DaleDe Grand Sorcerer Posts: 11,470 Karma: 13095790 Join Date: Aug 2007 Location: Grass Valley, CA Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7	Many of the problems are due to the idea that you can exchange data in text format. This is fallacious for books, particular novels where dialog is involved. Most ever book I post takes extensive looks and modification to fix things that were already supposed to be ok. Dale

10-02-2007, 04:40 AM	#42
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	dale, i'm not sure i understand your point. got any examples? -bowerbird

10-02-2007, 05:41 AM	#44
bowerbird Banned Posts: 269 Karma: -273 Join Date: Sep 2006 Location: los angeles	andy said: > Though be grateful for the fact that > the text is out there at all and > you don't have to OCR it yourself! well heck, i'm _extremely_ grateful for project gutenberg. as the forerunner of _all_ the net collaboration projects, including wikipedia, it has _tremendous_ value to me... so that's first and foremost. having said that, however, o.c.r. ain't difficult these days. scanning (and all that it entails, including rounding up a hard-copy to scan) is the hardest part of the equation, and google (and others) are taking care of all that hassle. but yeah, as i said, correcting that o.c.r. is where all the p.g. e-texts will come in handy, in the next cyberlibrary. > Also you can see the issue from the point of view > of the original transcribers as well. For example > I've just been restoring the italics in the PG text of Nostromo, > and very often the transcriber users initial caps for a word > that was originally in italics - probably a more elegant and > reader-friendly solution than using forward slashes for italicized words. well, maybe. the problem is, though, that it's an ambiguous coding, so it becomes impossible to restore things to their original state... a forward-slashes method -- while maybe not "reader-friendly" -- would have at least been unambiguous enough to easily un-do... > I don't understand why you would need a new mark-up, > correctly used, html mark-up [eg h1 for the book title > h2 for the part or section title and h3 for the chapter] > gives you all the semantic information you need. well, the problem with .html is that its obtrusive markup makes it hard to maintain (e.g., correct, edit, compare, update, re-mix, etc.), as well as to read in the underlying "master" format. do a view-source on this page: > http://z-m-l.com/go/test-suite.html then compare that source-html to this page: > http://z-m-l.com/go/test-suite.zml particularly since the .zml file actually _generated_ the .html one, i think it's pretty easy to tell which file would be easier to maintain, especially with a library of thousands of e-texts (let alone millions). and then of course when you ratchet up the difficulty to the level of the .epub format, where each e-text file needs accompanying files, you're just asking for trouble. in my view, complex formats like that are simply the old-guard dinosaur publishing-houses attempting to raise the cost-of-entry for us "amateur" newbies, whose new capacity for self-publishing will totally and completely subvert their business. they're attempting to find a way to maintain their status as middlemen, so they can continue to siphon off a good percentage of the revenue... > Personally I believe that plain vanilla html > (or its baby siblings markdown, textile etc) is the new ascii. markdown and textfile are both light-markup systems, and thus of the same type as my zen markup language. (except my z.m.l. is even less obtrusive than they are.) but yes, this is the way of the future. authors want to write, not be caught up in unnecessary complexities of file-formats. -bowerbird