Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > ePub


Thread Tools Search this Thread
Old 10-11-2024, 04:05 PM   #1
Fitz Frobozz
Fitz Frobozz began at the beginning.
Fitz Frobozz's Avatar
Posts: 28
Karma: 10
Join Date: May 2024
Device: Kindle Scribe
Digital preservation: ePUB for the archiving of text and other media

Greetings all. Forgive me if this has been hashed out to death already in other threads (pointers to them would be most welcome, especially if they are not super old).

TLDR; Have you used ePUB to preserve or archive anything? How did it go? What did you learn?


I'm interested in seeing other perspectives and stories that relate to where I am on my path as I continue to explore the merits of ePUB, this time focusing on efforts to create a digital archive for collections of personal and family texts and photos.

For anyone reading this who might not know, The US Library of Congress' Recommended Formats Statement lists EPUB3-compliant XML as the format of choice for text preservation:

The Library of Congress Recommended Formats Statement (RFS) lists EPUB 3 as a Preferred format for Textual Works - Digital.
As an XML-based format using publicly documented schemas that represent the logical structure of a publication, EPUB satisfies most of the desired characteristics for formats for textual works, if the content files are not encrypted, if the file is not subject to technological protection that inhibits long-term preservation and access, and if all content is stored within the EPUB container. Bibliographic metadata records, for example, in the ONIX schema, may optionally be included in the EPUB container or may be available through a link to an external record. The Library of Congress would want to receive or access such metadata records in conjunction with ingestion of an EPUB publication.
I would love to hear about projects others have been involved in that draw upon their best practices. What has worked best for you, and what hasn't? How did you solve issues or problems? Where did you end up deviating from their guidelines, and what new guidelines might you put forward if you had the chance?

As for me, at this juncture it's still early days and I'm still messing around, so I don't believe I have much to share yet that would be of significant value to anyone else, but I'll go ahead and outline it anyway.

The archive I'm looking at right now is a combination of my own stuff and stuff that I somehow ended up with that's been handed down over a few generations. The majority of it appears to be text, with photos being the next biggest media type, then audio and video, and finally an uncertain overall amount of digital.

Needless to say, for the text I'm looking at epub3 for the final storage format, but am considering also using it as an optional and maybe convenient way to view other media types as well. (E.g. physical photos and AV would be stored using appropriate methods, and stored with them would be the unedited digital versions and then maybe some epubs that could make casual browsing a little easier for whoever digs up the vault in the future.)

As I began, I decided that I would start by experimenting on myself using my own personal paper-based journals. (I also have journals in other kinds of media but but I won't go into that here just yet.) First up is a small selection of 100 pages or so. When I have these pages in a form that I'm happy with, my plan is to then use whatever I learn from the process to do things properly as I begin to more seriously tackle other more fragile/important parts of the collection.

Currently the epub3 document I'm working with contains about 50 xhtml pages that are each devoted to a single color 1350x2000px PNG (16.5MG average file size). This image and file size already feels too big to me judging from a few little clues, and I'm wondering if I should cut it down by half. (Even that might be too large, for all I know.)

Next up will be experimenting with transcribed text, which I will mix in with the image-based pages. (Image, text, image, text, etc.) I may also mix in a small amount of additional media as required. For example just this morning, on a page following a journal entry I added a captioned 1995 photograph that shows something I was writing about in 2004. The photo is for illustration purposes only and it's currently a 2400x3300 (12MB) PNG grid of three images.

I'll leave it there. Fingers crossed that others will feel the urge to share, too.

Last edited by Fitz Frobozz; 10-11-2024 at 04:32 PM.
Fitz Frobozz is offline   Reply With Quote
Old 10-11-2024, 04:20 PM   #2
Fitz Frobozz
Fitz Frobozz began at the beginning.
Fitz Frobozz's Avatar
Posts: 28
Karma: 10
Join Date: May 2024
Device: Kindle Scribe
Reserving just in case I decide to put project stuff here.
Fitz Frobozz is offline   Reply With Quote
Old 10-13-2024, 02:32 PM   #3
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
Posts: 2,303
Karma: 12126963
Join Date: Jul 2012
Device: Kobo Forma, Nook
Sounds great. Welcome to the forum.

Originally Posted by Fitz Frobozz View Post
Forgive me if this has been hashed out to death already in other threads (pointers to them would be most welcome, especially if they are not super old).

TLDR; Have you used ePUB to preserve or archive anything? How did it go? What did you learn? [...]

I'm [...] focusing on efforts to create a digital archive for collections of personal and family texts and photos. [...]

I would love to hear about projects others have been involved in that draw upon their best practices.
I linked to a ton of my best summaries/resources/tutorials back in:

That should cover pretty much any/all best practices + digitization questions.

Originally Posted by Fitz Frobozz View Post
Currently the epub3 document I'm working with contains about 50 xhtml pages that are each devoted to a single color 1350x2000px PNG (16.5MG average file size). This image and file size already feels too big to me judging from a few little clues, and I'm wondering if I should cut it down by half. (Even that might be too large, for all I know.)
What are the images? Are they photographs? Can you post some samples?

Sounds to me like you may accidentally just be plopping in images of "scanned pages" into your EPUBs.

If your images are just scans of pages out of books, you'd need to OCR and change those into actual text.

If the images are photographs—like of people, trees, etc.—you can probably use JPGs instead of PNGs. That will save lots of space too.
Tex2002ans is offline   Reply With Quote
Old 10-15-2024, 12:09 AM   #4
Fitz Frobozz
Fitz Frobozz began at the beginning.
Fitz Frobozz's Avatar
Posts: 28
Karma: 10
Join Date: May 2024
Device: Kindle Scribe
Originally Posted by Tex2002ans View Post
Sounds great. Welcome to the forum.

I linked to a ton of my best summaries/resources/tutorials back in:

That should cover pretty much any/all best practices + digitization questions.

What are the images? Are they photographs? Can you post some samples?

Sounds to me like you may accidentally just be plopping in images of "scanned pages" into your EPUBs.

If your images are just scans of pages out of books, you'd need to OCR and change those into actual text.

If the images are photographs—like of people, trees, etc.—you can probably use JPGs instead of PNGs. That will save lots of space too.

Oh, sorry about that, it looks like I could have taken more care to clarify the above: I'm actually intentionally scanning pages in "photo mode" and adding the resultant PNGs (or JPGs, or whatever I end up going with) to the ePub as they are and manually transcribing the same pages into text. The goal being to provide both versions for every document in the collection.

RE OCR, I'd be (very pleasantly) surprised if that were a viable option given that the majority of the original documents are handwritten.

Last edited by Fitz Frobozz; 10-15-2024 at 03:21 AM.
Fitz Frobozz is offline   Reply With Quote
Old 10-15-2024, 03:24 AM   #5
Fitz Frobozz
Fitz Frobozz began at the beginning.
Fitz Frobozz's Avatar
Posts: 28
Karma: 10
Join Date: May 2024
Device: Kindle Scribe
Just a thought. Does Amazon use a standard and/or some library for their Scribe OCR, or something proprietary? I'm assuming it's homegrown/proprietary but thought I'd check. That OCR is surprisingly good. At least, it has been for my Scribe scribblings.
Fitz Frobozz is offline   Reply With Quote

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Good News About Old Media: How The Atlantic Went Digital stonetools News 4 12-23-2011 08:08 PM
digital media and printed media are the same... mattbiernat Amazon Kindle 0 08-13-2010 07:55 PM
Cooper blog: News media is lost about digital media, too Steven Lyle Jordan Deals and Resources (No Self-Promotion or Affiliate Links) 0 11-05-2007 10:06 AM
Palm Digital Media and PalmGear Griff Deals and Resources (No Self-Promotion or Affiliate Links) 5 10-07-2003 02:47 AM

All times are GMT -4. The time now is 04:25 PM. is a privately owned, operated and funded community.