|  07-30-2014, 07:59 AM | #16 | |
| Banned            Posts: 488 Karma: 1080260 Join Date: Sep 2012 Device: sony prs t1 kindle dx ipad | Quote: 
 5x8 pdfs i.e. A5 formated, can be viewed in landscape (margins cropped), two or three screens per page on 6" reader or in portraite on 10". For A4 formated pdfs, the best solution is 10" landscape (margins cropped if necessary), again two or three screens per page. To be able to use highlighting, search, dictionary etc. pdf must be ocr-ed beforehand. If our e-ink reader's pdf zooming capabilities are not good or quick enough for our liking we can use k2pdfopt to adjust pdf easily and quickly for our reader beforehand. http://www.willus.com/k2pdfopt/ Pdf can be easily disassembled back to images if our pdf reader, editor or tool have that fuction i.e. can save or export pdf page(s) as images. Last edited by markom; 07-30-2014 at 09:04 AM. | |
|   |   | 
|  08-01-2014, 03:15 AM | #17 | 
| Addict            Posts: 272 Karma: 8000000 Join Date: Oct 2010 Location: Corvallis, OR Device: Kindle PW2, iPad Pro | 
			
			I usually keep everything in PDF with OCR under the image page.  They tend to be huge files but you can search and the pages are exact.  This is especially true of cookbooks for me.
		 | 
|   |   | 
|  08-01-2014, 08:03 AM | #18 | |
| Banned            Posts: 488 Karma: 1080260 Join Date: Sep 2012 Device: sony prs t1 kindle dx ipad | Quote: 
 http://blogs.adobe.com/acrolaw/2009/...rscan_is_smal/ But it could then take 3-4 seconds for slower ereaders to turn the next page, so in that case bigger pdf size (lower compression method) is better idea if we want to turn the next page faster i.e. about one second. Last edited by markom; 08-01-2014 at 11:50 AM. | |
|   |   | 
|  08-08-2014, 08:07 AM | #19 | 
| Addict            Posts: 272 Karma: 8000000 Join Date: Oct 2010 Location: Corvallis, OR Device: Kindle PW2, iPad Pro | 
			
			I do a lot of color or greyscale so they are more like 20 or 30 mb. The new iPad Air with good reader handles them fine. I like greyscale as it is smoother and more like the original text than b/w.
		 | 
|   |   | 
|  08-08-2014, 09:34 AM | #20 | 
| Color me gone            Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300 | 
			
			It is fine to use large images if you only intend they be used on tablets which have a lot more horsepower than ereaders. If they are just personal, knock yourself out. If they are commercial, you might irritate readers who will focus on the slowness instead of the content. | 
|   |   | 
|  08-15-2014, 11:43 AM | #21 | |
| Addict            Posts: 272 Karma: 8000000 Join Date: Oct 2010 Location: Corvallis, OR Device: Kindle PW2, iPad Pro | Quote: 
 | |
|   |   | 
|  09-12-2014, 06:03 AM | #22 | 
| Banned            Posts: 28 Karma: 31454 Join Date: Sep 2014 Location: France Device: Kindle 3 | 
			
			Any experience with libre software?
		 | 
|   |   | 
|  09-12-2014, 07:15 AM | #23 | 
| Wizard            Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura | 
			
			It is not required to ask for free software in all threads. We got the message already. In this case it is easy, GIMP, ImageMagick, Popper and many more can extract images from PDF. OCR is a different beast all together.
		 | 
|   |   | 
|  09-13-2014, 02:08 AM | #24 | |
| Banned            Posts: 28 Karma: 31454 Join Date: Sep 2014 Location: France Device: Kindle 3 | Quote: 
 ImageMagick? No OCR. Popper OCR? Not that I know. But if you say "we got the message already" would you PLEASE answer the ones I asked, before you answer the ones I haven't asked? | |
|   |   | 
|  09-13-2014, 02:33 AM | #25 | 
| Wizard            Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura | 
			
			Fine. Google free OCR. Enough hits for you. One is even called FreeOCR. Knock yourself out.
		 | 
|   |   | 
|  09-13-2014, 02:34 AM | #26 | |
| Wizard            Posts: 2,306 Karma: 13057279 Join Date: Jul 2012 Device: Kobo Forma, Nook | Quote: 
 https://www.mobileread.com/forums/sho...2&postcount=13 Here is the Wikipedia link again: https://en.wikipedia.org/wiki/Compar...ition_software Most likely the only free OCR of note would be Tesseract (and most of the Free OCR programs out there would use (most likely an outdated) version of Tesseract in the backend). I already explained many of the disadvantages of the free solutions above. Although you are free to read the Tesseract documentation and do much of the training/tweaking needed. I personally would just err on the side of the paid OCR programs, ESPECIALLY when dealing with non-English works, or works with lots of accented characters. While the proprietary OCR programs are not zero dollars initially, they would save you A TON of time in all of your post-OCR processing steps (which is where you WILL spend most of your time). The more accurate/clean you can get your input, you will have to spend MUCH less time cleaning, and getting the document into a readable state. Besides that, you can use GIMP/Inkscape/Imagemagick in order to manipulate the images fine.  I prefer using all free software over proprietary whenever I can, but sadly, OCR is just one area where the free solutions don't hold much of a candle. Last edited by Tex2002ans; 09-13-2014 at 02:37 AM. | |
|   |   | 
|  09-13-2014, 02:35 AM | #27 | 
| frumious Bandersnatch            Posts: 7,570 Karma: 20150435 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura | Moderator Notice Before this thread degrades into name calling and uncivil behaviour, please everybody think twice. | 
|   |   | 
|  09-13-2014, 03:48 AM | #28 | ||||
| Banned            Posts: 28 Karma: 31454 Join Date: Sep 2014 Location: France Device: Kindle 3 | Quote: 
 Quote: 
 Quote: 
 Quote: 
 | ||||
|   |   | 
|  09-13-2014, 05:39 AM | #29 | ||
| Wizard            Posts: 2,306 Karma: 13057279 Join Date: Jul 2012 Device: Kobo Forma, Nook | Quote: 
 If you want free, Tesseract is probably the best bet. Quote: 
 Also, keep in mind that Scan Tailor was really only built as a MIDDLEWARE program, to fit into a workflow like this: Dirty/Speckled/Warped/Crappy scans/photos -> Scan Tailor -> OCR program. It was made to try to clean up the images, so that OCR can (potentially) be more accurate. Only thing I have found that Scan Tailor does better than Finereader is handling speckled documents, although with all of the negative baggage that comes with Scan Tailor, I have settled on cleaning speckles directly using Imagemagick. Last edited by Tex2002ans; 09-13-2014 at 05:45 AM. | ||
|   |   | 
|  09-13-2014, 06:32 AM | #30 | |
| Wizard            Posts: 3,465 Karma: 10684861 Join Date: May 2006 Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20 | Quote: 
 Step 2 - program convert from imagemagick to parse pdf into bitmaps. Works on Windows, Linux, Mac Step 3 - use Tesseract - an open source OCR. Works on Windows, Linux, Mac Step 4 - use advanced text editor (in my case Vim) to format the text that has broken lines by default. The paragraphs are separated by empty lines, so it is very easy to join all the lines that are not separated by empty line. Works on Windows, Linux, Mac (Commands for Vim in normal mode [press Escape twice to to get there]: :set tw=10000 gggqG [gg means go to the first line of file gq means "rewrap the text to text width set with previous command :set tw=10000" G means "to the end of the file"] ) Tesseract is now being developed by Google - it used to be heavy duty commercial OCR engine http://en.wikipedia.org/wiki/Tesseract_%28software%29 I have used this to OCR files on Linux. For step 2 and 3 I have used following script: Code: #!/bin/sh
STARTPAGE=13 # set to pagenumber of the first page of PDF you wish to convert
ENDPAGE=253 # set to pagenumber of the last page of PDF you wish to convert
SOURCE=book.pdf # set to the file name of the PDF
OUTPUT=book.txt # set to the final output file
RESOLUTION=600 # set to the resolution the scanner used (the higher, the better)
touch $OUTPUT
for i in `seq $STARTPAGE $ENDPAGE`; do
    convert -monochrome -density $RESOLUTION $SOURCE\[$(($i - 1 ))\] page.tif
    echo processing page \[$(($i - 1 ))\]
    tesseract page.tif tempoutput
    cat tempoutput.txt >> $OUTPUT
donePlease note that output from Tesseract is txt file that doesn't contain formatting info, such as bold, italics, that other [commercial] programs can produce. An example of a really good commercial program is Abbyy FineReader. At work I use very old version of Recognita, that doesn't process pdf, so I have to convert with imagemagick. But, it was bundled with a scanner that we purchased very long time ago. At work I also use Readiris - does process pdf, was bundled with HP scanner/printer/copier/fax combo some 5 years ago. Last edited by kacir; 09-13-2014 at 06:59 AM. | |
|   |   | 
|  | 
| 
 | 
|  Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| no text extraction for pdf with images and OCR | fxp33 | Conversion | 7 | 12-15-2015 07:22 AM | 
| Cover images for pdf files on Kindle PW | blz777 | Amazon Kindle | 0 | 07-21-2013 10:45 AM | 
| Google Adds OCR for PDF Files | kjk | News | 0 | 06-22-2010 02:27 PM | 
| Can I view images in PDF files ? | eisho | Sony Reader | 1 | 08-03-2008 08:49 PM | 
| Sony reader for PDF files: pages as images | claudioita | Sony Reader | 3 | 07-30-2007 02:46 PM |