MobileRead Forums - View Single Post

DaltonST · 06-06-2018, 07:54 PM

Scanned books are pure images, so do not have pagination like "real" books, since they have little "real" text. They are images of scanned text.

OCR'd books should have many more pages than their Scanned versions, since the whole purpose of OCR is to "read" the Scanned images and convert the information contained in the images into "real" text.

Your example of the same book in 2 formats (scanned images versus OCR) having the identical page-count is astonishing unless something is peculiar about the composition of the scanned book and then how it was OCR'd.

Perhaps what you believe to be OCR'd books are actually scanned books that have had their images compressed, so the file sizes are smaller yet the page-counts are similar.

Perhaps the OCR process was not successful, but the OCR program compressed the scanned images anyway during the creation of the output file of the OCR process.

In my libraries, I can easily tell a "scanned" epub from a "real" epub, because the "scanned" version has only a very few pages, but a huge file size. The "real" epubs may have many "pictures" causing them to be large in file size, but they also have a large number of pages.

Also, using the Library Codes plug-in to automatically extract an ISSN from a "scanned" file will fail, just as using the Job Spy plug-in to extract the Translator and Original Title will fail, but will succeed for a textual version (which you refer to as OCR'd).

Perhaps running all of your .pdf files through a (the same) .pdf compression utility will provide some clarity since that particular variable will have been made into a constant.

Well, regardless, the above are my thoughts. Hope they help you figure things out.

DaltonST

06-06-2018, 07:54 PM	#4
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	Scanned books are pure images, so do not have pagination like "real" books, since they have little "real" text. They are images of scanned text. OCR'd books should have many more pages than their Scanned versions, since the whole purpose of OCR is to "read" the Scanned images and convert the information contained in the images into "real" text. Your example of the same book in 2 formats (scanned images versus OCR) having the identical page-count is astonishing unless something is peculiar about the composition of the scanned book and then how it was OCR'd. Perhaps what you believe to be OCR'd books are actually scanned books that have had their images compressed, so the file sizes are smaller yet the page-counts are similar. Perhaps the OCR process was not successful, but the OCR program compressed the scanned images anyway during the creation of the output file of the OCR process. In my libraries, I can easily tell a "scanned" epub from a "real" epub, because the "scanned" version has only a very few pages, but a huge file size. The "real" epubs may have many "pictures" causing them to be large in file size, but they also have a large number of pages. Also, using the Library Codes plug-in to automatically extract an ISSN from a "scanned" file will fail, just as using the Job Spy plug-in to extract the Translator and Original Title will fail, but will succeed for a textual version (which you refer to as OCR'd). Perhaps running all of your .pdf files through a (the same) .pdf compression utility will provide some clarity since that particular variable will have been made into a constant. Well, regardless, the above are my thoughts. Hope they help you figure things out. DaltonST