MobileRead Forums

MobileRead Forums (https://www.mobileread.com/forums/index.php)
-   ePub (https://www.mobileread.com/forums/forumdisplay.php?f=179)
-   -   How do you get rid of all images in an ePub file downloaded from Archive.org? (https://www.mobileread.com/forums/showthread.php?t=339971)

2scre 06-11-2021 11:18 AM

How do you get rid of all images in an ePub file downloaded from Archive.org?
 
When I download ePub version of a book on Archive.org, I’m seeing not pure text but text mixed with images of the book pages. Is there a way to get just pure text version? Or is there a way to delete all images in an ePub file on Sigil?

Turtle91 06-11-2021 11:38 AM

Normally Sigil and/or Calibre questions would be asked in their respective forum.

However, to delete images simply highlight the image(s) on the left side of the screen (Bookbrowser in Sigil) and hit the delete key.

You will probably also want to delete the code which references the image from your html file(s). That can be done with a regex:

search: <img.*?/>
replace: nothing/blank

Quoth 06-11-2021 12:28 PM

Quote:

Originally Posted by 2scre (Post 4129459)
When I download ePub version of a book on Archive.org, I’m seeing not pure text but text mixed with images of the book pages. Is there a way to get just pure text version? Or is there a way to delete all images in an ePub file on Sigil?

It's because it's daft automatic conversion from scanned images.
You can also do what is suggested in Calibre Editor as well as Sigil.

DiapDealer 06-11-2021 03:27 PM

I don't think I'd bother, myself. If you delete all those images of text, you'll probably be missing some content. My recommendation would be to delete the epub in question and find an alternative version.

JSWolf 06-11-2021 03:54 PM

Quote:

Originally Posted by DiapDealer (Post 4129537)
I don't think I'd bother, myself. If you delete all those images of text, you'll probably be missing some content. My recommendation would be to delete the epub in question and find an alternative version.

I agree that it's best to buy the eBook if a retail version exists and if not, go with the pBook version or forget it and read something else.

Quoth 06-11-2021 04:56 PM

Or do your own OCR if it's really really important PD content not available as cheap ebook.

Tex2002ans 06-14-2021 05:40 AM

Quote:

Originally Posted by 2scre (Post 4129459)
When I download ePub version of a book [...] is there a way to delete all images in an ePub file on Sigil?

Tools > Reports > Image Files

This allows you to see all the images in the EPUB + little preview thumbnails (so you could tell if it's useless or an actual important image).

You could then Right-Click each image and "Delete From Book".

Turtle91 06-14-2021 05:48 PM

Quote:

Originally Posted by Tex2002ans (Post 4130132)
Tools > Reports > Image Files

This allows you to see all the images in the EPUB + little preview thumbnails (so you could tell if it's useless or an actual important image).

You could then Right-Click each image and "Delete From Book".


You can also multi-select using ctrl+click or shift+click, then the del key, to delete all of them at once.

DNSB 06-14-2021 07:00 PM

I've only gotten books from archive.org a couple of times. In both cases, what was displayed was the scanned image with the text layer hidden. I suspected that this was an artifact from making the scan to PDF searchable since the text files were fine lessons in how not to do OCR.

Tex2002ans 06-14-2021 10:19 PM

Quote:

Originally Posted by Turtle91 (Post 4130294)
You can also multi-select using ctrl+click or shift+click, then the del key, to delete all of them at once.

:thumbsup:

Quote:

Originally Posted by DNSB (Post 4130316)
I've only gotten books from archive.org a couple of times. In both cases, what was displayed was the scanned image with the text layer hidden. I suspected that this was an artifact from making the scan to PDF searchable since the text files were fine lessons in how not to do OCR.

Just discussed their "EPUBs" in this thread a few weeks ago:

"Archive.org ePub"

All Archive.org's text formats are auto-generated OCR from the PDFs, no cleanup, no nothing.

In Post #11, I even uploaded an EPUB straight out of Finereader 12... and you can see how much cleaner (and more readable) it is compared to the auto-generated junk.

This is why I always recommend: PDF from Archive.org, then convert to text on your own if needed.


All times are GMT -4. The time now is 08:07 PM.

Powered by: vBulletin
Copyright ©2000 - 3.8.5, Jelsoft Enterprises Ltd.
MobileRead.com is a privately owned, operated and funded community.