Can you OCR the images inside of .pdf files? - Page 2

markom · 07-30-2014, 07:59 AM

Quote:

Originally Posted by klmmc13

...
While a lot of the Homeschool crowd are printing PDF's and binding at 5x8; usually it's because they don't have other options. I want to present them with another option -- most, if not all of that years' books ready to load onto a reader.
(When my kids were Homeschooling, they'd much rather have their books on the readers, than a half-size notebook or home-bound edition.)

Again, Thanks All
Kathy MamaDragon

I don't convert pdf to epub/mobi or print it to paper because I can read it easily on 6" and 10" eink readers or tablets and (even more importantly to me) without any visible OCR errors and additional time consuming tedious labour.

5x8 pdfs i.e. A5 formated, can be viewed in landscape (margins cropped), two or three screens per page on 6" reader or in portraite on 10".

For A4 formated pdfs, the best solution is 10" landscape (margins cropped if necessary), again two or three screens per page.

To be able to use highlighting, search, dictionary etc. pdf must be ocr-ed beforehand.

If our e-ink reader's pdf zooming capabilities are not good or quick enough for our liking we can use k2pdfopt to adjust pdf easily and quickly for our reader beforehand.

http://www.willus.com/k2pdfopt/

Pdf can be easily disassembled back to images if our pdf reader, editor or tool have that fuction i.e. can save or export pdf page(s) as images.

harriska2 · 08-01-2014, 03:15 AM

I usually keep everything in PDF with OCR under the image page. They tend to be huge files but you can search and the pages are exact. This is especially true of cookbooks for me.

markom · 08-01-2014, 08:03 AM

Quote:

Originally Posted by harriska2

I usually keep everything in PDF with OCR under the image page. They tend to be huge files but you can search and the pages are exact. This is especially true of cookbooks for me.

If there are not a lot of pictures in a book, average 500 page pdf (exact book image) should be under 10 MB size, whether using newer Abbyy 11/12 (ocr under image) or Acrobat 11 (clearscan).

http://blogs.adobe.com/acrolaw/2009/...rscan_is_smal/

But it could then take 3-4 seconds for slower ereaders to turn the next page, so in that case bigger pdf size (lower compression method) is better idea if we want to turn the next page faster i.e. about one second.

harriska2 · 08-08-2014, 08:07 AM

I do a lot of color or greyscale so they are more like 20 or 30 mb. The new iPad Air with good reader handles them fine. I like greyscale as it is smoother and more like the original text than b/w.

mrmikel · 08-08-2014, 09:34 AM

It is fine to use large images if you only intend they be used on tablets which have a lot more horsepower than ereaders.

If they are just personal, knock yourself out. If they are commercial, you might irritate readers who will focus on the slowness instead of the content.

harriska2 · 08-15-2014, 11:43 AM

Quote:

Originally Posted by mrmikel

It is fine to use large images if you only intend they be used on tablets which have a lot more horsepower than ereaders.

If they are just personal, knock yourself out. If they are commercial, you might irritate readers who will focus on the slowness instead of the content.

Yeah, mine are just personal. I'm with you on irritating readers because of slowness.

shevirsy · 09-12-2014, 06:03 AM

Any experience with libre software?

Toxaris · 09-12-2014, 07:15 AM

It is not required to ask for free software in all threads. We got the message already. In this case it is easy, GIMP, ImageMagick, Popper and many more can extract images from PDF. OCR is a different beast all together.

shevirsy · 09-13-2014, 02:08 AM

Quote:

Originally Posted by Toxaris

It is not required to ask for free software in all threads. We got the message already. In this case it is easy, GIMP, ImageMagick, Popper and many more can extract images from PDF. OCR is a different beast all together.

GIMP? No OCR.
ImageMagick? No OCR.
Popper OCR? Not that I know.

But if you say "we got the message already" would you PLEASE answer the ones I asked, before you answer the ones I haven't asked?

Toxaris · 09-13-2014, 02:33 AM

Quote:

Originally Posted by shevirsy

GIMP? No OCR.
ImageMagick? No OCR.
Popper OCR? Not that I know.

But if you say "we got the message already" would you PLEASE answer the ones I asked, before you answer the ones I haven't asked?

Fine. Google free OCR. Enough hits for you. One is even called FreeOCR. Knock yourself out.

Tex2002ans · 09-13-2014, 02:34 AM

Quote:

Originally Posted by shevirsy

But if you say "we got the message already" would you PLEASE answer the ones I asked, before you answer the ones I haven't asked?

I already linked to a Wikipedia article showing off a comparison of many different OCR programs in Post #13 right in this topic:

https://www.mobileread.com/forums/sho...2&postcount=13

Here is the Wikipedia link again:

https://en.wikipedia.org/wiki/Compar...ition_software

Most likely the only free OCR of note would be Tesseract (and most of the Free OCR programs out there would use (most likely an outdated) version of Tesseract in the backend).

I already explained many of the disadvantages of the free solutions above. Although you are free to read the Tesseract documentation and do much of the training/tweaking needed.

I personally would just err on the side of the paid OCR programs, ESPECIALLY when dealing with non-English works, or works with lots of accented characters. While the proprietary OCR programs are not zero dollars initially, they would save you A TON of time in all of your post-OCR processing steps (which is where you WILL spend most of your time). The more accurate/clean you can get your input, you will have to spend MUCH less time cleaning, and getting the document into a readable state.

Besides that, you can use GIMP/Inkscape/Imagemagick in order to manipulate the images fine.

I prefer using all free software over proprietary whenever I can, but sadly, OCR is just one area where the free solutions don't hold much of a candle.

Jellby · 09-13-2014, 02:35 AM

Moderator Notice
Before this thread degrades into name calling and uncivil behaviour, please everybody think twice.

shevirsy · 09-13-2014, 03:48 AM

Quote:

Originally Posted by Tex2002ans

I already linked to a Wikipedia article showing off a comparison of many different OCR programs in Post #13 right in this topic:

Thanks for the links. I know the wikipedia article. It's depressing. Abbyy Finereader and that's about all. I was hoping for some missed gem.

Quote:

Most likely the only free OCR of note would be Tesseract (and most of the Free OCR programs out there would use (most likely an outdated) version of Tesseract in the backend).

I already explained many of the disadvantages of the free solutions above. Although you are free to read the Tesseract documentation and do much of the training/tweaking needed.

I stay away from these tools made to help front ends. I need a front-end, I am not the front-end. I try OCRfeeder. When it comes to a few pages, it can be better than typing the pages.

Quote:

I personally would just err on the side of the paid OCR programs, ESPECIALLY when dealing with non-English works, or works with lots of accented characters. While the proprietary OCR programs are not zero dollars initially, they would save you A TON of time in all of your post-OCR processing steps (which is where you WILL spend most of your time). The more accurate/clean you can get your input, you will have to spend MUCH less time cleaning, and getting the document into a readable state.

You do have a point.

Quote:

Besides that, you can use GIMP/Inkscape/Imagemagick in order to manipulate the images fine.

I prefer using all free software over proprietary whenever I can, but sadly, OCR is just one area where the free solutions don't hold much of a candle.

Well, Scan Tailor might help a lot more. But Jellby is right and I won't call names the users who just groom their post count. I just say "bye bye" and add them on ignore.

Tex2002ans · 09-13-2014, 05:39 AM

Quote:

Originally Posted by shevirsy

Thanks for the links. I know the wikipedia article. It's depressing. Abbyy Finereader and that's about all. I was hoping for some missed gem.

There are a few proprietary programs that aren't on that list, and it does seem like that Wikipedia comparison COULD use some updating (for example, it says Finereader's latest version says 11, when 12 came out earlier this year).

If you want free, Tesseract is probably the best bet.

Quote:

Originally Posted by shevirsy

Well, Scan Tailor might help a lot more. But Jellby is right and I won't call names the users who just groom their post count. I just say "bye bye" and add them on ignore.

I used Scan Tailor when I first started to get into this, but now I lean in favor of the tools just built directly in Finereader. I find that Scan Tailor manipulated the original source images a little TOO much for my liking. (Another reason to lean towards the proprietary programs instead of free, a lot of the image manipulation tools are built-in, and allow easy tweaks/comparisons with the original source, while with something like Tesseract, you will get ONLY the OCR portion).

Also, keep in mind that Scan Tailor was really only built as a MIDDLEWARE program, to fit into a workflow like this:

Dirty/Speckled/Warped/Crappy scans/photos -> Scan Tailor -> OCR program.

It was made to try to clean up the images, so that OCR can (potentially) be more accurate.

Only thing I have found that Scan Tailor does better than Finereader is handling speckled documents, although with all of the negative baggage that comes with Scan Tailor, I have settled on cleaning speckles directly using Imagemagick.

kacir · 09-13-2014, 06:32 AM

Quote:

Originally Posted by shevirsy

GIMP? No OCR.
ImageMagick? No OCR.
Popper OCR? Not that I know.

But if you say "we got the message already" would you PLEASE answer the ones I asked, before you answer the ones I haven't asked?

step 1 - cut the pdf - margins with pagenumbers, all you do not want to OCR using pdfscissors - a small program in Java run directly from the net. Works on Windows, Linux, Mac
Step 2 - program convert from imagemagick to parse pdf into bitmaps. Works on Windows, Linux, Mac
Step 3 - use Tesseract - an open source OCR. Works on Windows, Linux, Mac
Step 4 - use advanced text editor (in my case Vim) to format the text that has broken lines by default. The paragraphs are separated by empty lines, so it is very easy to join all the lines that are not separated by empty line. Works on Windows, Linux, Mac
(Commands for Vim in normal mode [press Escape twice to to get there]:
:set tw=10000
gggqG
[gg means go to the first line of file
gq means "rewrap the text to text width set with previous command :set tw=10000"
G means "to the end of the file"]
)

Tesseract is now being developed by Google - it used to be heavy duty commercial OCR engine
http://en.wikipedia.org/wiki/Tesseract_%28software%29

I have used this to OCR files on Linux.
For step 2 and 3 I have used following script:

Code:

#!/bin/sh
STARTPAGE=13 # set to pagenumber of the first page of PDF you wish to convert
ENDPAGE=253 # set to pagenumber of the last page of PDF you wish to convert
SOURCE=book.pdf # set to the file name of the PDF
OUTPUT=book.txt # set to the final output file
RESOLUTION=600 # set to the resolution the scanner used (the higher, the better)

touch $OUTPUT
for i in `seq $STARTPAGE $ENDPAGE`; do
    convert -monochrome -density $RESOLUTION $SOURCE\[$(($i - 1 ))\] page.tif
    echo processing page \[$(($i - 1 ))\]
    tesseract page.tif tempoutput
    cat tempoutput.txt >> $OUTPUT
done

There are various graphical front-ends for Tesseract, so you do not HAVE to use commandline. But this is what worked for me.

Please note that output from Tesseract is txt file that doesn't contain formatting info, such as bold, italics, that other [commercial] programs can produce. An example of a really good commercial program is Abbyy FineReader.
At work I use very old version of Recognita, that doesn't process pdf, so I have to convert with imagemagick. But, it was bundled with a scanner that we purchased very long time ago.
At work I also use Readiris - does process pdf, was bundled with HP scanner/printer/copier/fax combo some 5 years ago.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
no text extraction for pdf with images and OCR	fxp33	Conversion	7	12-15-2015 07:22 AM
Cover images for pdf files on Kindle PW	blz777	Amazon Kindle	0	07-21-2013 10:45 AM
Google Adds OCR for PDF Files	kjk	News	0	06-22-2010 02:27 PM
Can I view images in PDF files ?	eisho	Sony Reader	1	08-03-2008 08:49 PM
Sony reader for PDF files: pages as images	claudioita	Sony Reader	3	07-30-2007 02:46 PM

08-01-2014, 03:15 AM	#17
harriska2 Addict Posts: 272 Karma: 8000000 Join Date: Oct 2010 Location: Corvallis, OR Device: Kindle PW2, iPad Pro	I usually keep everything in PDF with OCR under the image page. They tend to be huge files but you can search and the pages are exact. This is especially true of cookbooks for me.

08-08-2014, 08:07 AM	#19
harriska2 Addict Posts: 272 Karma: 8000000 Join Date: Oct 2010 Location: Corvallis, OR Device: Kindle PW2, iPad Pro	I do a lot of color or greyscale so they are more like 20 or 30 mb. The new iPad Air with good reader handles them fine. I like greyscale as it is smoother and more like the original text than b/w.

08-08-2014, 09:34 AM	#20
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	It is fine to use large images if you only intend they be used on tablets which have a lot more horsepower than ereaders. If they are just personal, knock yourself out. If they are commercial, you might irritate readers who will focus on the slowness instead of the content.

09-12-2014, 06:03 AM	#22
shevirsy Banned Posts: 28 Karma: 31454 Join Date: Sep 2014 Location: France Device: Kindle 3	Any experience with libre software?

09-12-2014, 07:15 AM	#23
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	It is not required to ask for free software in all threads. We got the message already. In this case it is easy, GIMP, ImageMagick, Popper and many more can extract images from PDF. OCR is a different beast all together.

09-13-2014, 02:35 AM	#27
Jellby frumious Bandersnatch Posts: 7,592 Karma: 22000001 Join Date: Jan 2008 Location: Spaniard in Germany Device: Cybook Orizon, Kobo Aura	Moderator Notice Before this thread degrades into name calling and uncivil behaviour, please everybody think twice.