![]() |
#16 | |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 488
Karma: 1080260
Join Date: Sep 2012
Device: sony prs t1 kindle dx ipad
|
Quote:
5x8 pdfs i.e. A5 formated, can be viewed in landscape (margins cropped), two or three screens per page on 6" reader or in portraite on 10". For A4 formated pdfs, the best solution is 10" landscape (margins cropped if necessary), again two or three screens per page. To be able to use highlighting, search, dictionary etc. pdf must be ocr-ed beforehand. If our e-ink reader's pdf zooming capabilities are not good or quick enough for our liking we can use k2pdfopt to adjust pdf easily and quickly for our reader beforehand. http://www.willus.com/k2pdfopt/ Pdf can be easily disassembled back to images if our pdf reader, editor or tool have that fuction i.e. can save or export pdf page(s) as images. Last edited by markom; 07-30-2014 at 09:04 AM. |
|
![]() |
![]() |
![]() |
#17 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 272
Karma: 8000000
Join Date: Oct 2010
Location: Corvallis, OR
Device: Kindle PW2, iPad Pro
|
I usually keep everything in PDF with OCR under the image page. They tend to be huge files but you can search and the pages are exact. This is especially true of cookbooks for me.
|
![]() |
![]() |
Advert | |
|
![]() |
#18 | |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 488
Karma: 1080260
Join Date: Sep 2012
Device: sony prs t1 kindle dx ipad
|
Quote:
http://blogs.adobe.com/acrolaw/2009/...rscan_is_smal/ But it could then take 3-4 seconds for slower ereaders to turn the next page, so in that case bigger pdf size (lower compression method) is better idea if we want to turn the next page faster i.e. about one second. Last edited by markom; 08-01-2014 at 11:50 AM. |
|
![]() |
![]() |
![]() |
#19 |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 272
Karma: 8000000
Join Date: Oct 2010
Location: Corvallis, OR
Device: Kindle PW2, iPad Pro
|
I do a lot of color or greyscale so they are more like 20 or 30 mb. The new iPad Air with good reader handles them fine. I like greyscale as it is smoother and more like the original text than b/w.
|
![]() |
![]() |
![]() |
#20 |
Color me gone
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
It is fine to use large images if you only intend they be used on tablets which have a lot more horsepower than ereaders.
If they are just personal, knock yourself out. If they are commercial, you might irritate readers who will focus on the slowness instead of the content. |
![]() |
![]() |
Advert | |
|
![]() |
#21 | |
Addict
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 272
Karma: 8000000
Join Date: Oct 2010
Location: Corvallis, OR
Device: Kindle PW2, iPad Pro
|
Quote:
|
|
![]() |
![]() |
![]() |
#22 |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28
Karma: 31454
Join Date: Sep 2014
Location: France
Device: Kindle 3
|
Any experience with libre software?
|
![]() |
![]() |
![]() |
#23 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
It is not required to ask for free software in all threads. We got the message already. In this case it is easy, GIMP, ImageMagick, Popper and many more can extract images from PDF. OCR is a different beast all together.
|
![]() |
![]() |
![]() |
#24 | |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28
Karma: 31454
Join Date: Sep 2014
Location: France
Device: Kindle 3
|
Quote:
ImageMagick? No OCR. Popper OCR? Not that I know. But if you say "we got the message already" would you PLEASE answer the ones I asked, before you answer the ones I haven't asked? |
|
![]() |
![]() |
![]() |
#25 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Fine. Google free OCR. Enough hits for you. One is even called FreeOCR. Knock yourself out.
|
![]() |
![]() |
![]() |
#26 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
https://www.mobileread.com/forums/sho...2&postcount=13 Here is the Wikipedia link again: https://en.wikipedia.org/wiki/Compar...ition_software Most likely the only free OCR of note would be Tesseract (and most of the Free OCR programs out there would use (most likely an outdated) version of Tesseract in the backend). I already explained many of the disadvantages of the free solutions above. Although you are free to read the Tesseract documentation and do much of the training/tweaking needed. I personally would just err on the side of the paid OCR programs, ESPECIALLY when dealing with non-English works, or works with lots of accented characters. While the proprietary OCR programs are not zero dollars initially, they would save you A TON of time in all of your post-OCR processing steps (which is where you WILL spend most of your time). The more accurate/clean you can get your input, you will have to spend MUCH less time cleaning, and getting the document into a readable state. Besides that, you can use GIMP/Inkscape/Imagemagick in order to manipulate the images fine. ![]() Last edited by Tex2002ans; 09-13-2014 at 02:37 AM. |
|
![]() |
![]() |
![]() |
#27 |
frumious Bandersnatch
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,544
Karma: 19001583
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Moderator Notice
Before this thread degrades into name calling and uncivil behaviour, please everybody think twice. |
![]() |
![]() |
![]() |
#28 | ||||
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 28
Karma: 31454
Join Date: Sep 2014
Location: France
Device: Kindle 3
|
Quote:
Quote:
Quote:
Quote:
|
||||
![]() |
![]() |
![]() |
#29 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
If you want free, Tesseract is probably the best bet. Quote:
Also, keep in mind that Scan Tailor was really only built as a MIDDLEWARE program, to fit into a workflow like this: Dirty/Speckled/Warped/Crappy scans/photos -> Scan Tailor -> OCR program. It was made to try to clean up the images, so that OCR can (potentially) be more accurate. Only thing I have found that Scan Tailor does better than Finereader is handling speckled documents, although with all of the negative baggage that comes with Scan Tailor, I have settled on cleaning speckles directly using Imagemagick. Last edited by Tex2002ans; 09-13-2014 at 05:45 AM. |
||
![]() |
![]() |
![]() |
#30 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,463
Karma: 10684861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
|
Quote:
Step 2 - program convert from imagemagick to parse pdf into bitmaps. Works on Windows, Linux, Mac Step 3 - use Tesseract - an open source OCR. Works on Windows, Linux, Mac Step 4 - use advanced text editor (in my case Vim) to format the text that has broken lines by default. The paragraphs are separated by empty lines, so it is very easy to join all the lines that are not separated by empty line. Works on Windows, Linux, Mac (Commands for Vim in normal mode [press Escape twice to to get there]: :set tw=10000 gggqG [gg means go to the first line of file gq means "rewrap the text to text width set with previous command :set tw=10000" G means "to the end of the file"] ) Tesseract is now being developed by Google - it used to be heavy duty commercial OCR engine http://en.wikipedia.org/wiki/Tesseract_%28software%29 I have used this to OCR files on Linux. For step 2 and 3 I have used following script: Code:
#!/bin/sh STARTPAGE=13 # set to pagenumber of the first page of PDF you wish to convert ENDPAGE=253 # set to pagenumber of the last page of PDF you wish to convert SOURCE=book.pdf # set to the file name of the PDF OUTPUT=book.txt # set to the final output file RESOLUTION=600 # set to the resolution the scanner used (the higher, the better) touch $OUTPUT for i in `seq $STARTPAGE $ENDPAGE`; do convert -monochrome -density $RESOLUTION $SOURCE\[$(($i - 1 ))\] page.tif echo processing page \[$(($i - 1 ))\] tesseract page.tif tempoutput cat tempoutput.txt >> $OUTPUT done Please note that output from Tesseract is txt file that doesn't contain formatting info, such as bold, italics, that other [commercial] programs can produce. An example of a really good commercial program is Abbyy FineReader. At work I use very old version of Recognita, that doesn't process pdf, so I have to convert with imagemagick. But, it was bundled with a scanner that we purchased very long time ago. At work I also use Readiris - does process pdf, was bundled with HP scanner/printer/copier/fax combo some 5 years ago. Last edited by kacir; 09-13-2014 at 06:59 AM. |
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
no text extraction for pdf with images and OCR | fxp33 | Conversion | 7 | 12-15-2015 07:22 AM |
Cover images for pdf files on Kindle PW | blz777 | Amazon Kindle | 0 | 07-21-2013 10:45 AM |
Google Adds OCR for PDF Files | kjk | News | 0 | 06-22-2010 02:27 PM |
Can I view images in PDF files ? | eisho | Sony Reader | 1 | 08-03-2008 08:49 PM |
Sony reader for PDF files: pages as images | claudioita | Sony Reader | 3 | 07-30-2007 02:46 PM |