![]() |
#1 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 87
Karma: 1234
Join Date: Sep 2012
Device: Onyx Boox M92
|
Assess if a pdf file is editable
Dear Sirs,
I wonder whether there is a plug-in of Calibre to rapidly assess whether a pdf ebook file is editable (that is OCR'ed) or not. Of course it is always possible to manually open a pdf file and verify that its characters are ANSI ones and not images, but it is not very practical with thousands files! Thanks in advance for a possible answer |
![]() |
![]() |
![]() |
#2 |
Deviser
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,265
Karma: 2090983
Join Date: Aug 2013
Location: Texas
Device: none
|
Run the Count Pages plug-in, and then compare the number of pages to the Size (MB) shown in Calibre. Scanned (pure image) .pdf files often have only a few (e.g. 3-10) pages for a huge file size. Use the View Manager plug-in to Sort by pages, ascending, then Size, descending, to quickly compare.
DaltonST Last edited by DaltonST; 06-06-2018 at 06:27 PM. Reason: clarified scanned .pdf files |
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 87
Karma: 1234
Join Date: Sep 2012
Device: Onyx Boox M92
|
Dear Dalton,
thanks for the answer, but I am not sure to have well understood your way. Let me give you an example of a book which I have OCR'd through Acrobat Image Book: size 31.1 MiB, 344 pages OCR'd Book: size 19.8 MiB, 344 pages Therefore I think this difference in size is too small to be significant in recognising image books present in my library (of course I have NOT both record at the same time!) |
![]() |
![]() |
![]() |
#4 |
Deviser
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,265
Karma: 2090983
Join Date: Aug 2013
Location: Texas
Device: none
|
Scanned books are pure images, so do not have pagination like "real" books, since they have little "real" text. They are images of scanned text.
OCR'd books should have many more pages than their Scanned versions, since the whole purpose of OCR is to "read" the Scanned images and convert the information contained in the images into "real" text. Your example of the same book in 2 formats (scanned images versus OCR) having the identical page-count is astonishing unless something is peculiar about the composition of the scanned book and then how it was OCR'd. Perhaps what you believe to be OCR'd books are actually scanned books that have had their images compressed, so the file sizes are smaller yet the page-counts are similar. Perhaps the OCR process was not successful, but the OCR program compressed the scanned images anyway during the creation of the output file of the OCR process. In my libraries, I can easily tell a "scanned" epub from a "real" epub, because the "scanned" version has only a very few pages, but a huge file size. The "real" epubs may have many "pictures" causing them to be large in file size, but they also have a large number of pages. Also, using the Library Codes plug-in to automatically extract an ISSN from a "scanned" file will fail, just as using the Job Spy plug-in to extract the Translator and Original Title will fail, but will succeed for a textual version (which you refer to as OCR'd). Perhaps running all of your .pdf files through a (the same) .pdf compression utility will provide some clarity since that particular variable will have been made into a constant. Well, regardless, the above are my thoughts. Hope they help you figure things out. DaltonST |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
editable custom columns? | blaenk | Library Management | 9 | 07-03-2015 03:55 AM |
Need Text extraction engin from editable PDF | qsipl | Workshop | 17 | 05-23-2014 07:26 PM |
Creating a standard editable format | ebooks-love | Calibre | 9 | 01-15-2012 06:52 PM |
editable files on Kindle3 like htm/txt | tocatoca | Amazon Kindle | 0 | 10-13-2010 07:28 PM |
User-Editable HTML in Templates? | marcot | Calibre | 0 | 06-15-2010 09:19 AM |