Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Library Management

Notices

Reply
 
Thread Tools Search this Thread
Old 09-25-2014, 04:39 PM   #1
ardeur
Member
ardeur began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Apr 2011
Location: CA
Device: Nook Simple Touch, 4th Gen Kindle Basic
Searching for corrupted epubs?

I have about 5000 epub books and I've begun to notice that a few of them are "corrupted" (all weird characters, partial words) when I put them on my Nook and open them. I'm pretty sure it was a bad file that I downloaded from gutenberg or another place.

Is there a feature in Calibre that can search for messed up files like these so I can delete them?
ardeur is offline   Reply With Quote
Old 09-25-2014, 06:02 PM   #2
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 30,967
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by ardeur View Post
I have about 5000 epub books and I've begun to notice that a few of them are "corrupted" (all weird characters, partial words) when I put them on my Nook and open them. I'm pretty sure it was a bad file that I downloaded from gutenberg or another place.

Is there a feature in Calibre that can search for messed up files like these so I can delete them?
A 'Corrupt' EPUB would not work at all (it is a Zip) the CRC would fail..

Bad OCR or Wrong Character encoding.

Neither is normal PG completed book fare.
Are you sure you did not get a 'Proofing Project' file?
(but aren't they normally given out as partials?)

PG books all have a PG Scan notes page up front.
theducks is offline   Reply With Quote
Advert
Old 09-25-2014, 06:56 PM   #3
ardeur
Member
ardeur began at the beginning.
 
Posts: 15
Karma: 10
Join Date: Apr 2011
Location: CA
Device: Nook Simple Touch, 4th Gen Kindle Basic
I downloaded from archive, manybooks, google books, and PG and that was years ago. I have no idea what I might have done to get these weird files. I'm thinking some of them may have come from google ebooks to be honest.
ardeur is offline   Reply With Quote
Old 10-01-2014, 06:44 PM   #4
LadyKate
Fanatic
LadyKate ought to be getting tired of karma fortunes by now.LadyKate ought to be getting tired of karma fortunes by now.LadyKate ought to be getting tired of karma fortunes by now.LadyKate ought to be getting tired of karma fortunes by now.LadyKate ought to be getting tired of karma fortunes by now.LadyKate ought to be getting tired of karma fortunes by now.LadyKate ought to be getting tired of karma fortunes by now.LadyKate ought to be getting tired of karma fortunes by now.LadyKate ought to be getting tired of karma fortunes by now.LadyKate ought to be getting tired of karma fortunes by now.LadyKate ought to be getting tired of karma fortunes by now.
 
Posts: 515
Karma: 1470724
Join Date: Jul 2013
Location: Quebec CA
Device: android 4 (samsung tablet and asus tablet)
Quote:
Originally Posted by ardeur View Post
I have about 5000 epub books and I've begun to notice that a few of them are "corrupted" (all weird characters, partial words) when I put them on my Nook and open them. I'm pretty sure it was a bad file that I downloaded from gutenberg or another place.

Is there a feature in Calibre that can search for messed up files like these so I can delete them?
How do these books look when you use the calibre reader to look at them. It almost sounds like a case of the wrong character encoding used in the conversion to epub.
LadyKate is offline   Reply With Quote
Old 02-13-2015, 08:42 AM   #5
Rob557
Zealot
Rob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-books
 
Posts: 108
Karma: 810
Join Date: Jul 2012
Device: Kobo
Bulk Library Search for OCR Warning Indicators

Quote:
Originally Posted by ardeur View Post
Is there a feature in Calibre that can search for messed up files like these so I can delete them?
One approach to separating out ebooks with potential problems would be to use the Search ePub option under the Quality Check add-on for Calibre. You can search your library (using Quality Check's "search scope" setting and also specifying that looking only at the text contents), for any ePub's that contain the OCR warning indicator "�". You can also search for any ePub's that contain the OCR warning caret "^" but make sure that you use the search criteria "\^" or else all your books will be identified.

Having done that, and using a temporary column to label the selected books that contain those OCR warnings, the number of occurrences for those characters within any one book can be determined using the "search - Count All" feature in Sigil or Calibre's book-edit, but does anyone know of a Calibre feature that could perform a bulk determination of the number of occurrences of such character strings in that selected subset of books such that the number for each book can be stored in a temporary sort column in Calibre in order to more easily find the most problematic books?
Rob557 is offline   Reply With Quote
Advert
Old 02-13-2015, 10:05 AM   #6
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 30,967
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Rob557 View Post
One approach to separating out ebooks with potential problems would be to use the Search ePub option under the Quality Check add-on for Calibre. You can search your library (using Quality Check's "search scope" setting and also specifying that looking only at the text contents), for any ePub's that contain the OCR warning indicator "�". You can also search for any ePub's that contain the OCR warning caret "^" but make sure that you use the search criteria "\^" or else all your books will be identified.

Having done that, and using a temporary column to label the selected books that contain those OCR warnings, the number of occurrences for those characters within any one book can be determined using the "search - Count All" feature in Sigil or Calibre's book-edit, but does anyone know of a Calibre feature that could perform a bulk determination of the number of occurrences of such character strings in that selected subset of books such that the number for each book can be stored in a temporary sort column in Calibre in order to more easily find the most problematic books?
I believe that character is substituted by the render engine to mean 1 of Many possible missing from the current character-font set.
Some OS use a square, others display a box with the (utf?)code digits
theducks is offline   Reply With Quote
Old 02-13-2015, 11:06 AM   #7
Rob557
Zealot
Rob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-books
 
Posts: 108
Karma: 810
Join Date: Jul 2012
Device: Kobo
Quote:
Originally Posted by theducks View Post
I believe that character is substituted by the render engine to mean 1 of Many possible missing from the current character-font set. Some OS use a square, others display a box with the (utf?)code digits
As a test I inserted some symbols into Word and then tried to change the font selection for those symbols, which eventually produced the sort of white square that theducks was mentioning. The code inside the white square only became visible when I copied it into this posting:  . I guess if there are certain characters (e.g. some quote characters) that a user finds are being replaced by that sort of coded square, then that coded square could be used in addition to "�" and "^" to bulk-identify potential problem books, but it looks as though there may not be one generic white square for problems and instead each one might contain a code unique to a problem character? There may be other error-indicator symbols (I seem to recall a small black square) and it may just be a matter of tripping across them and adding them to the list.

Theducks, in light of your experience with Calibre, a second take-away from your response would I guess be that there doesn't seem to be any current functionality in Calibre or its add-ons that would allow a user to make a bulk determination, as per my prior email, of the number of occurrences of such character strings in the selected subset of problem books.
Rob557 is offline   Reply With Quote
Old 02-13-2015, 01:14 PM   #8
Rob557
Zealot
Rob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-books
 
Posts: 108
Karma: 810
Join Date: Jul 2012
Device: Kobo
Bulk Library Search for OCR Warning Indicators (partially solved)

It turns out that the Search ePub feature under the Quality Check add-on for Calibre includes an option to "show all occurrences". That option at least indirectly provides a bulk search capability to identify the number of times a special OCR-error character like � appears in the various ePubs.

If the "show all occurrences" option is selected (maybe best at first to limit this to a subset of the books identified as having at least one occurrence), then every occurrence within that subset of ePubs is listed on a separate line in the results log and that list can be copied into something like Excel to be analyzed to identify the most problematic books.
Rob557 is offline   Reply With Quote
Old 02-16-2015, 11:29 AM   #9
Rob557
Zealot
Rob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-books
 
Posts: 108
Karma: 810
Join Date: Jul 2012
Device: Kobo
small black square too

As per the approach described above, the following is another OCR error warning symbol that, in addition to "\^" and "�", can be used as a search criteria in Search ePub: "■".

As noted above, "search for all occurrences" can be used to bulk-determine which ePubs may have the worst problems.

It's not clear as yet which version of the code-embedded white square e.g. "" might be used as a broad search criteria.
Rob557 is offline   Reply With Quote
Old 02-16-2015, 11:49 AM   #10
Rob557
Zealot
Rob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-booksRob557 has learned how to read e-books
 
Posts: 108
Karma: 810
Join Date: Jul 2012
Device: Kobo
small black square - NO

My mistake.

I found that, at least for the ePubs I have encountered, searching for that small black square "■' does not really help much in finding ePubs with extensive OCR errors. When I checked ePubs with a lot of occurrences (whether or not there were also "^" warnings), that black square was almost invariably being used for other display reasons and not as an OCR error indicator.

Last edited by Rob557; 02-16-2015 at 11:51 AM.
Rob557 is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
0.96 corrupted? mopgcw Calibre 1 11-14-2012 03:16 PM
trouble when converting many epubs to epubs comet Conversion 13 03-21-2012 01:57 AM
Touch Problem with all epubs, my epubs, or my kobo? (line clipping) plague006 Kobo Reader 14 12-02-2011 11:32 PM
Searching and converting all EPUBs I have Giuseppe Chillem Calibre 3 11-14-2011 04:57 AM


All times are GMT -4. The time now is 10:18 PM.


MobileRead.com is a privately owned, operated and funded community.