Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 03-24-2021, 10:38 PM   #1
jadhvaryu
Junior Member
jadhvaryu began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2021
Device: iPad
Gutenberg ebooks

** I am here first time so pardon if this is repeat question **

Hi All,

For my project, i need to pull gutenberg ebooks (html & epub) formats based on genres, languages and authors.

However, I checked more than 100 books randomly but find that most books have missing/incomplete genres & authors.

Is this generally true or i am making some mistakes.

Also, i am using ebooklib to read epub but find lot of limitations.

I have struggling with this topic for several weeks now and hence would much appreciate any guidance in right direction.

Thanks in advance
jadhvaryu is offline   Reply With Quote
Old 03-25-2021, 02:13 AM   #2
Sarmat89
Evangelist
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 482
Karma: 2267928
Join Date: Nov 2015
Device: none
I wouldn't bother with project Gutenberg. They still use some 90's toolset to produce their books and the result sucks.
Sarmat89 is offline   Reply With Quote
Advert
Old 03-25-2021, 08:17 AM   #3
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,158
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
It's the best source of public domain.
Any formatting or metadata issues are easily fixed with Calibre.

I've no idea what "ebooklib" is. The eink such as Kobo and Kindle are best. Maybe the Kobo Libra is the best value 7" without adverts. Then there are very many poor apps on iOS and Android.
KOreader (available from its website) allows some changes to format on an ereader or Android (install APK). I use it on a Boyue Likebook Mars.
For an old Android phone or Tablet the Aldiko Classic (Playstore) and for a newer one use Lithium. The Wiki here has listings for iOS (Apple).
Quoth is offline   Reply With Quote
Old 03-25-2021, 02:10 PM   #4
jadhvaryu
Junior Member
jadhvaryu began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2021
Device: iPad
Thank you Sarmat89 and Qoth.

Qoth, ebooklib is python package that allows to read epub books in python program. How does calibre help? Does has repository of ebooks with enriched metadata that i can download/mirror for free like project gutenberg?

Samrat89, am curious to know the reasons of your comments on gutenberg? Is there any alternative you find better?

Thanks!
jadhvaryu is offline   Reply With Quote
Old 03-26-2021, 09:53 AM   #5
Sunlite
Addict
Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.Sunlite ought to be getting tired of karma fortunes by now.
 
Sunlite's Avatar
 
Posts: 206
Karma: 547516
Join Date: Mar 2008
Location: Berlin, Germany
Device: KObo Clara, Kobo Aura, PRS-T1, PB602, CyBook Gen3
Calibre is a library tool to handle your ebooks. It can download meta data for your books.

It also contains an ebook viewer to read your books and an editor to correct bad formating.

Calibre can also convert many ebook formats into one another.

You can find many well groomed public domain ebooks right here in the library of Mobileread:
Patricia Clark Memorial Library
Sunlite is offline   Reply With Quote
Advert
Old 03-26-2021, 10:39 AM   #6
jadhvaryu
Junior Member
jadhvaryu began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2021
Device: iPad
Quote:
Originally Posted by Sunlite View Post
Calibre is a library tool to handle your ebooks. It can download meta data for your books.

It also contains an ebook viewer to read your books and an editor to correct bad formating.

Calibre can also convert many ebook formats into one another.

You can find many well groomed public domain ebooks right here in the library of Mobileread:
Patricia Clark Memorial Library
Thank you Sunlite.
jadhvaryu is offline   Reply With Quote
Old 03-26-2021, 12:42 PM   #7
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,158
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Sometimes the metadata found is for a DIFFERENT book, or has errors. So everything needs human reviewed.

Better advice is possible if you explain not a particular issue but what your end result is.
Quoth is offline   Reply With Quote
Old 03-26-2021, 04:13 PM   #8
BobC
Guru
BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.BobC ought to be getting tired of karma fortunes by now.
 
Posts: 691
Karma: 3026110
Join Date: Dec 2008
Location: Lancashire, U.K.
Device: BeBook 1, BeBook Pure, Kobo Glo, (and HD),Energy Sistem EReader Pro +
Quote:
Originally Posted by jadhvaryu View Post

For my project, i need to pull gutenberg ebooks (html & epub) formats based on genres, languages and authors.

However, I checked more than 100 books randomly but find that most books have missing/incomplete genres & authors.

Is this generally true or i am making some mistakes.
I'm surprised you are finding problems with Author's name - Gutenberg will normally show these correctly. Regarding "genres" if you look at the biblio record for a Gutenberg book you will see that they don't use "genres" but the LOC (Library of Congress) class. This Wikipedia page - https://en.wikipedia.org/wiki/Librar...Classification shows the various "head" classifications which you might be able to use to map to your "genres".

However when you download a gutenberg Epub the "tags" that you get appear to be those for "Subject" in the biblio record and not the LOC class.

BobC
BobC is offline   Reply With Quote
Old 03-26-2021, 04:15 PM   #9
jadhvaryu
Junior Member
jadhvaryu began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2021
Device: iPad
Quote:
Originally Posted by Quoth View Post
Sometimes the metadata found is for a DIFFERENT book, or has errors. So everything needs human reviewed.

Better advice is possible if you explain not a particular issue but what your end result is.
Hi Quoth,

I am part of team working on a purpose-built reader around Gutenberg free books. And need enriched metadata to help in books searching and selection. Essentially: we need title, abstract/summary, author(s), publisher(s), genre(s), keyword(s)/tag(s) and ISBN#. For now, we only care for English books.

Also, the book format we prefer is HTML.

And like i mentioned earlier, I randomly checked more than 100 books and found the completeness of meta data is consistently poor.

And hence need a way to enrich it.

I played with Calibre a bit. But seems it allows:
- search results to be only 25 books
- only interactive download of one format at a time
- and metadata gathered still seems limited. (I tried the popular book "Complete Works" by William Shakespeare but still metadata was not enough.

Also, i have already downloaded big set of gutenberg ebooks (HTML version zip file). Can i 'import' these books into calibre?

Lastly, the purpose built reader will be priced for profit. We will not charge for the Gutenberg books, just the reader. If we end up using Calibre to maintain our books and update metadata, who can we talk to to understand usage / licensing terms.

Many thanks!
jadhvaryu is offline   Reply With Quote
Old 03-26-2021, 04:20 PM   #10
jadhvaryu
Junior Member
jadhvaryu began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2021
Device: iPad
Quote:
Originally Posted by BobC View Post
I'm surprised you are finding problems with Author's name - Gutenberg will normally show these correctly. Regarding "genres" if you look at the biblio record for a Gutenberg book you will see that they don't use "genres" but the LOC (Library of Congress) class. This Wikipedia page - https://en.wikipedia.org/wiki/Librar...Classification shows the various "head" classifications which you might be able to use to map to your "genres".

However when you download a gutenberg Epub the "tags" that you get appear to be those for "Subject" in the biblio record and not the LOC class.

BobC
Hi BobC,

I will look up the LoC classes. You mention gutenberg biblio records. Are they embedded in the ebook itself or separate data. Could you point me to them?

Thank you for your comments. They are helpful and point me possibly in the right direction to meet my need.

Best regards
jadhvaryu is offline   Reply With Quote
Old 03-26-2021, 04:55 PM   #11
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by jadhvaryu View Post
Essentially: we need title, abstract/summary, author(s), publisher(s), genre(s), keyword(s)/tag(s) and ISBN#
Title and author(s), you can expect them

Abstract/summary, genre(s), keyword(s)/tag(s) are all subjective. Someone has to fill them in and the values person A may choose will be different from those of person B. I don't know if any Gutenberg book has any of those, but I wouldn't trust them any more than the values you could get from any random bookstore...

Publisher(s), ISBN# make no sense for Gutenberg books. There is no publisher, or they're all "Project Gutenberg". Even if the transcription is initially based on a printed book with an actual publisher, that doesn't mean the Gutenberg book has that publisher. ISBN numbers are specific for specific editions. Every paper edition of some particular work has a different number (assigned, at least partially, by some external authority). If a books is officially published both in paper and electronic format, it will most likely have separate ISBNs for both. The vast majority of Gutenberg books were published long before ISBN existed, and even if some books is based on a printed version with ISBN, that is definitely not the ISBN of the Gutenberg book. Gutenberg books are not facsimiles of printed editions, they're just another version (without ISBN).
Jellby is online now   Reply With Quote
Old 03-26-2021, 05:02 PM   #12
jadhvaryu
Junior Member
jadhvaryu began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2021
Device: iPad
Quote:
Originally Posted by Jellby View Post
Title and author(s), you can expect them

Abstract/summary, genre(s), keyword(s)/tag(s) are all subjective. Someone has to fill them in and the values person A may choose will be different from those of person B. I don't know if any Gutenberg book has any of those, but I wouldn't trust them any more than the values you could get from any random bookstore...

Publisher(s), ISBN# make no sense for Gutenberg books. There is no publisher, or they're all "Project Gutenberg". Even if the transcription is initially based on a printed book with an actual publisher, that doesn't mean the Gutenberg book has that publisher. ISBN numbers are specific for specific editions. Every paper edition of some particular work has a different number (assigned, at least partially, by some external authority). If a books is officially published both in paper and electronic format, it will most likely have separate ISBNs for both. The vast majority of Gutenberg books were published long before ISBN existed, and even if some books is based on a printed version with ISBN, that is definitely not the ISBN of the Gutenberg book. Gutenberg books are not facsimiles of printed editions, they're just another version (without ISBN).
Thank you Jellby. Your comments help me understanding this space better.

Do you have any comments on ebook formats. We are planning to stick to html format and not epub (which also is HTML zip). What is any extra advantage of epub (except may be rights mgmt) over HTML format?

Best regards
jadhvaryu is offline   Reply With Quote
Old 03-26-2021, 05:03 PM   #13
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,158
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
You download the books OUTSIDE of Calibre and import them!

HTML is a less good format to download from Gutenberg. Better to download epub or mobi (called Kindle now) and then convert it if you want HTML. But HTML on its own is poor for ebooks. That's why Mobi and epub were invented as they are really a zipped directory with HTML files for the body, typically one per chapter, CSS (typically two files), a system index, a resource file listing what is in it, image files and font files, if used.

HTML is not a sensible format for any sort of proper ereader. The HTML is only for people that want to use a web browser, which is historic and madness today.

Don't implement a reader that simply uses Gutenberg on demand. That's "cloud madness philosophy". Have a browser that can download, or a directory to import to and have the reader only use local files.

Calibre is primarily a program to manage ebooks already copied to the computer. It imports a copy. Then you can manage metadata, searches and conversions and transfers to an ereader, or storage on phone/tablet that has an ereader app.
Oddly there is a full contents search for epubs as an option in "Quality Check" tool.

Normally epub is the best format to use. But Mobi is older. Some mobi may have old mobi and Kindle KF8/azw3 in the same file! However I find downloading "Kindle Format" from Gutenberg works best. Then I import that to Calibre, convert to ePub2 (using various options to fix quotes, remove paragraph space and have 1.4em first line indent, and embed Georgia font. Set line height and minimum line height both zero to allow user to change it, subset fonts etc).

Then I check the cover and other metadata in Edit Meta data. There are plug-ins to search very many websites.

I make sure the author name is consistent and correct before a metadata search.

I was using Gutenberg when Kindle format was called Mobi, epub didn't exist, Kindle didn't exist and the first eink (by Sony) was later that year.
Palm PDAs and Symbian Phones were probably the first gadgets other than laptops you could read ebooks on before dedicated ereaders existed. Gutenberg is THAT old, which is why text and HTML are offered.

Last edited by Quoth; 03-26-2021 at 05:13 PM.
Quoth is offline   Reply With Quote
Old 03-26-2021, 05:08 PM   #14
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,158
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Quote:
Originally Posted by Jellby View Post
Abstract/summary, genre(s), keyword(s)/tag(s) are all subjective. Someone has to fill them in and the values person A may choose will be different from those of person B. … but I wouldn't trust them any more than the values you could get from any random bookstore.

Publisher(s), ISBN# make no sense for Gutenberg books. There is no publisher, or they're all "Project Gutenberg". Even if the transcription is initially based on a printed book with an actual publisher, that doesn't mean the Gutenberg book has that publisher. ISBN numbers are specific for specific editions. Every paper edition of some particular work has a different number (assigned, at least partially, by some external authority). If a books is officially published both in paper and electronic format, it will most likely have separate ISBNs for both. The vast majority of Gutenberg books were published long before ISBN existed, and even if some books is based on a printed version with ISBN, that is definitely not the ISBN of the Gutenberg book. Gutenberg books are not facsimiles of printed editions, they're just another version (without ISBN).
This is EXACTLY what I was writing less clearly earlier. Forget ISBN totally. It's only relevant to selling or publishing.
You need to decide what the genre(s) are. Some books are more than one and some don't fit easily. Especially the older they are.

Any ISBN you find will be a particular modern reprint and might not even be exactly the same text. It might be revised.
Different SIZE editions of a paper book with the same text have different ISBNs!
Quoth is offline   Reply With Quote
Old 03-26-2021, 05:14 PM   #15
jadhvaryu
Junior Member
jadhvaryu began at the beginning.
 
Posts: 8
Karma: 10
Join Date: Mar 2021
Device: iPad
Quote:
Originally Posted by Quoth View Post
You download the books OUTSIDE of Calibre and import them!

HTML is a less good format to download from Gutenberg. Better to download epub or mobi and then convert it if you want HTML. But HTML on its own is poor for ebooks. That's why Mobi and epub were invented as they are really a zipped directory with HTML files for the body, typically one per chapter, CSS (typically two files), a system index, a resource file listing what is in it, image files and font files, if used.

HTML is not a sensible format for any sort of proper ereader. The HTML is only for people that want to use a web browser, which is historic and madness today.

Don't implement a reader that simply uses Gutenberg on demand. That's "cloud madness philosophy". Have a browser that can download, or a directory to import to and have the reader only use local files.

Calibre is primarily a program to manage ebooks already copied to the computer. It imports a copy. Then you can manage metadata, searches and conversions and transfers to an ereader, or storage on phone/tablet that has an ereader app.
Oddly there is a full contents search for epubs as an option in "Quality Check" tool.

Normally epub is the best format to use. But Mobi is older. Some mobi may have old mobi and Kindle KF8/azw3 in the same file! However I find downloading "Kindle Format" from Gutenberg works best. Then I import that to Calibre, convert to ePub2 (using various options to fix quotes, remove paragraph space and have 1.4em first line indent, and embed Georgia font. Set line height and minimum line height both zero to allow user to change it, subset fonts etc).

Then I check the cover and other metadata in Edit Meta data. There are plug-ins to search very many websites.

I make sure the author name is consistent and correct before a metadata search.
Truly appreciate your guidance, Quoth. We had started looking at epub format first and found toc.ncx to content html mapping bit complicated, especially it seemed to end not necessary at the end of the chapter (a typical logical unit).

But will look again.

Thanks!
jadhvaryu is offline   Reply With Quote
Reply

Tags
ebook, ebooklib, epub, gutenberg, html


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Standard eBooks Is a Gutenberg Project You’ll Actually Use ZodWallop News 104 02-04-2023 03:31 PM
Best option for GUTENBERG? /free ebooks RodRiquez Which one should I buy? 50 01-18-2021 09:08 AM
KOBO compatibility with Project Gutenberg ebooks craigaross Introduce Yourself 5 04-16-2011 08:19 AM
Blackmask and Gutenberg Ebooks DVD Download piet123 Deals and Resources (No Self-Promotion or Affiliate Links) 4 05-19-2007 12:21 AM


All times are GMT -4. The time now is 01:33 PM.


MobileRead.com is a privately owned, operated and funded community.