Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 06-04-2018, 08:23 AM   #1
RotAnal
Connoisseur
RotAnal can extract oil from cheeseRotAnal can extract oil from cheeseRotAnal can extract oil from cheeseRotAnal can extract oil from cheeseRotAnal can extract oil from cheeseRotAnal can extract oil from cheeseRotAnal can extract oil from cheeseRotAnal can extract oil from cheeseRotAnal can extract oil from cheese
 
RotAnal's Avatar
 
Posts: 68
Karma: 1234
Join Date: Sep 2012
Device: Onyx Boox M92
Assess if a pdf file is editable

Dear Sirs,
I wonder whether there is a plug-in of Calibre to rapidly assess whether a pdf ebook file is editable (that is OCR'ed) or not. Of course it is always possible to manually open a pdf file and verify that its characters are ANSI ones and not images, but it is not very practical with thousands files!
Thanks in advance for a possible answer
RotAnal is offline   Reply With Quote
Advert
Old 06-04-2018, 09:20 AM   #2
DaltonST
Deviser
DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!
 
DaltonST's Avatar
 
Posts: 1,107
Karma: 100494
Join Date: Aug 2013
Location: Texas
Device: 8" Win10 Tablet w/Calibre64
Run the Count Pages plug-in, and then compare the number of pages to the Size (MB) shown in Calibre. Scanned (pure image) .pdf files often have only a few (e.g. 3-10) pages for a huge file size. Use the View Manager plug-in to Sort by pages, ascending, then Size, descending, to quickly compare.



DaltonST

Last edited by DaltonST; 06-06-2018 at 06:27 PM. Reason: clarified scanned .pdf files
DaltonST is offline   Reply With Quote
Old 06-06-2018, 12:16 PM   #3
RotAnal
Connoisseur
RotAnal can extract oil from cheeseRotAnal can extract oil from cheeseRotAnal can extract oil from cheeseRotAnal can extract oil from cheeseRotAnal can extract oil from cheeseRotAnal can extract oil from cheeseRotAnal can extract oil from cheeseRotAnal can extract oil from cheeseRotAnal can extract oil from cheese
 
RotAnal's Avatar
 
Posts: 68
Karma: 1234
Join Date: Sep 2012
Device: Onyx Boox M92
Dear Dalton,
thanks for the answer, but I am not sure to have well understood your way.
Let me give you an example of a book which I have OCR'd through Acrobat

Image Book: size 31.1 MiB, 344 pages
OCR'd Book: size 19.8 MiB, 344 pages

Therefore I think this difference in size is too small to be significant in recognising image books present in my library (of course I have NOT both record at the same time!)
RotAnal is offline   Reply With Quote
Old 06-06-2018, 06:54 PM   #4
DaltonST
Deviser
DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!DaltonST rocks like Gibraltar!
 
DaltonST's Avatar
 
Posts: 1,107
Karma: 100494
Join Date: Aug 2013
Location: Texas
Device: 8" Win10 Tablet w/Calibre64
Scanned books are pure images, so do not have pagination like "real" books, since they have little "real" text. They are images of scanned text.

OCR'd books should have many more pages than their Scanned versions, since the whole purpose of OCR is to "read" the Scanned images and convert the information contained in the images into "real" text.

Your example of the same book in 2 formats (scanned images versus OCR) having the identical page-count is astonishing unless something is peculiar about the composition of the scanned book and then how it was OCR'd.

Perhaps what you believe to be OCR'd books are actually scanned books that have had their images compressed, so the file sizes are smaller yet the page-counts are similar.

Perhaps the OCR process was not successful, but the OCR program compressed the scanned images anyway during the creation of the output file of the OCR process.

In my libraries, I can easily tell a "scanned" epub from a "real" epub, because the "scanned" version has only a very few pages, but a huge file size. The "real" epubs may have many "pictures" causing them to be large in file size, but they also have a large number of pages.

Also, using the Library Codes plug-in to automatically extract an ISSN from a "scanned" file will fail, just as using the Job Spy plug-in to extract the Translator and Original Title will fail, but will succeed for a textual version (which you refer to as OCR'd).

Perhaps running all of your .pdf files through a (the same) .pdf compression utility will provide some clarity since that particular variable will have been made into a constant.


Well, regardless, the above are my thoughts. Hope they help you figure things out.



DaltonST
DaltonST is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
editable custom columns? blaenk Library Management 9 07-03-2015 03:55 AM
Need Text extraction engin from editable PDF qsipl Workshop 17 05-23-2014 07:26 PM
Creating a standard editable format ebooks-love Calibre 9 01-15-2012 06:52 PM
editable files on Kindle3 like htm/txt tocatoca Amazon Kindle 0 10-13-2010 07:28 PM
User-Editable HTML in Templates? marcot Calibre 0 06-15-2010 09:19 AM


All times are GMT -4. The time now is 02:11 AM.


MobileRead.com is a privately owned, operated and funded community.