Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 07-30-2014, 07:59 AM   #16
markom
Banned
markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.
 
Posts: 488
Karma: 1080260
Join Date: Sep 2012
Device: sony prs t1 kindle dx ipad
Quote:
Originally Posted by klmmc13 View Post
...
While a lot of the Homeschool crowd are printing PDF's and binding at 5x8; usually it's because they don't have other options. I want to present them with another option -- most, if not all of that years' books ready to load onto a reader.
(When my kids were Homeschooling, they'd much rather have their books on the readers, than a half-size notebook or home-bound edition.)

Again, Thanks All
Kathy MamaDragon
I don't convert pdf to epub/mobi or print it to paper because I can read it easily on 6" and 10" eink readers or tablets and (even more importantly to me) without any visible OCR errors and additional time consuming tedious labour.

5x8 pdfs i.e. A5 formated, can be viewed in landscape (margins cropped), two or three screens per page on 6" reader or in portraite on 10".

For A4 formated pdfs, the best solution is 10" landscape (margins cropped if necessary), again two or three screens per page.

To be able to use highlighting, search, dictionary etc. pdf must be ocr-ed beforehand.

If our e-ink reader's pdf zooming capabilities are not good or quick enough for our liking we can use k2pdfopt to adjust pdf easily and quickly for our reader beforehand.

http://www.willus.com/k2pdfopt/

Pdf can be easily disassembled back to images if our pdf reader, editor or tool have that fuction i.e. can save or export pdf page(s) as images.

Last edited by markom; 07-30-2014 at 09:04 AM.
markom is offline   Reply With Quote
Old 08-01-2014, 03:15 AM   #17
harriska2
Addict
harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.
 
Posts: 272
Karma: 8000000
Join Date: Oct 2010
Location: Corvallis, OR
Device: Kindle PW2, iPad Pro
I usually keep everything in PDF with OCR under the image page. They tend to be huge files but you can search and the pages are exact. This is especially true of cookbooks for me.
harriska2 is offline   Reply With Quote
Advert
Old 08-01-2014, 08:03 AM   #18
markom
Banned
markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.markom ought to be getting tired of karma fortunes by now.
 
Posts: 488
Karma: 1080260
Join Date: Sep 2012
Device: sony prs t1 kindle dx ipad
Quote:
Originally Posted by harriska2 View Post
I usually keep everything in PDF with OCR under the image page. They tend to be huge files but you can search and the pages are exact. This is especially true of cookbooks for me.
If there are not a lot of pictures in a book, average 500 page pdf (exact book image) should be under 10 MB size, whether using newer Abbyy 11/12 (ocr under image) or Acrobat 11 (clearscan).

http://blogs.adobe.com/acrolaw/2009/...rscan_is_smal/

But it could then take 3-4 seconds for slower ereaders to turn the next page, so in that case bigger pdf size (lower compression method) is better idea if we want to turn the next page faster i.e. about one second.

Last edited by markom; 08-01-2014 at 11:50 AM.
markom is offline   Reply With Quote
Old 08-08-2014, 08:07 AM   #19
harriska2
Addict
harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.
 
Posts: 272
Karma: 8000000
Join Date: Oct 2010
Location: Corvallis, OR
Device: Kindle PW2, iPad Pro
I do a lot of color or greyscale so they are more like 20 or 30 mb. The new iPad Air with good reader handles them fine. I like greyscale as it is smoother and more like the original text than b/w.
harriska2 is offline   Reply With Quote
Old 08-08-2014, 09:34 AM   #20
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
It is fine to use large images if you only intend they be used on tablets which have a lot more horsepower than ereaders.

If they are just personal, knock yourself out. If they are commercial, you might irritate readers who will focus on the slowness instead of the content.
mrmikel is offline   Reply With Quote
Advert
Old 08-15-2014, 11:43 AM   #21
harriska2
Addict
harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.harriska2 ought to be getting tired of karma fortunes by now.
 
Posts: 272
Karma: 8000000
Join Date: Oct 2010
Location: Corvallis, OR
Device: Kindle PW2, iPad Pro
Quote:
Originally Posted by mrmikel View Post
It is fine to use large images if you only intend they be used on tablets which have a lot more horsepower than ereaders.

If they are just personal, knock yourself out. If they are commercial, you might irritate readers who will focus on the slowness instead of the content.
Yeah, mine are just personal. I'm with you on irritating readers because of slowness.
harriska2 is offline   Reply With Quote
Old 09-12-2014, 06:03 AM   #22
shevirsy
Banned
shevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the end
 
Posts: 28
Karma: 31454
Join Date: Sep 2014
Location: France
Device: Kindle 3
Any experience with libre software?
shevirsy is offline   Reply With Quote
Old 09-12-2014, 07:15 AM   #23
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
It is not required to ask for free software in all threads. We got the message already. In this case it is easy, GIMP, ImageMagick, Popper and many more can extract images from PDF. OCR is a different beast all together.
Toxaris is offline   Reply With Quote
Old 09-13-2014, 02:08 AM   #24
shevirsy
Banned
shevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the end
 
Posts: 28
Karma: 31454
Join Date: Sep 2014
Location: France
Device: Kindle 3
Quote:
Originally Posted by Toxaris View Post
It is not required to ask for free software in all threads. We got the message already. In this case it is easy, GIMP, ImageMagick, Popper and many more can extract images from PDF. OCR is a different beast all together.
GIMP? No OCR.
ImageMagick? No OCR.
Popper OCR? Not that I know.

But if you say "we got the message already" would you PLEASE answer the ones I asked, before you answer the ones I haven't asked?
shevirsy is offline   Reply With Quote
Old 09-13-2014, 02:33 AM   #25
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
Quote:
Originally Posted by shevirsy View Post
GIMP? No OCR.
ImageMagick? No OCR.
Popper OCR? Not that I know.

But if you say "we got the message already" would you PLEASE answer the ones I asked, before you answer the ones I haven't asked?
Fine. Google free OCR. Enough hits for you. One is even called FreeOCR. Knock yourself out.
Toxaris is offline   Reply With Quote
Old 09-13-2014, 02:34 AM   #26
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by shevirsy View Post
But if you say "we got the message already" would you PLEASE answer the ones I asked, before you answer the ones I haven't asked?
I already linked to a Wikipedia article showing off a comparison of many different OCR programs in Post #13 right in this topic:

https://www.mobileread.com/forums/sho...2&postcount=13

Here is the Wikipedia link again:

https://en.wikipedia.org/wiki/Compar...ition_software

Most likely the only free OCR of note would be Tesseract (and most of the Free OCR programs out there would use (most likely an outdated) version of Tesseract in the backend).

I already explained many of the disadvantages of the free solutions above. Although you are free to read the Tesseract documentation and do much of the training/tweaking needed.

I personally would just err on the side of the paid OCR programs, ESPECIALLY when dealing with non-English works, or works with lots of accented characters. While the proprietary OCR programs are not zero dollars initially, they would save you A TON of time in all of your post-OCR processing steps (which is where you WILL spend most of your time). The more accurate/clean you can get your input, you will have to spend MUCH less time cleaning, and getting the document into a readable state.

Besides that, you can use GIMP/Inkscape/Imagemagick in order to manipulate the images fine. I prefer using all free software over proprietary whenever I can, but sadly, OCR is just one area where the free solutions don't hold much of a candle.

Last edited by Tex2002ans; 09-13-2014 at 02:37 AM.
Tex2002ans is offline   Reply With Quote
Old 09-13-2014, 02:35 AM   #27
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,544
Karma: 19001583
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Moderator Notice
Before this thread degrades into name calling and uncivil behaviour, please everybody think twice.
Jellby is offline   Reply With Quote
Old 09-13-2014, 03:48 AM   #28
shevirsy
Banned
shevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the endshevirsy knows the complete value of PI to the end
 
Posts: 28
Karma: 31454
Join Date: Sep 2014
Location: France
Device: Kindle 3
Quote:
Originally Posted by Tex2002ans View Post
I already linked to a Wikipedia article showing off a comparison of many different OCR programs in Post #13 right in this topic:
Thanks for the links. I know the wikipedia article. It's depressing. Abbyy Finereader and that's about all. I was hoping for some missed gem.

Quote:
Most likely the only free OCR of note would be Tesseract (and most of the Free OCR programs out there would use (most likely an outdated) version of Tesseract in the backend).

I already explained many of the disadvantages of the free solutions above. Although you are free to read the Tesseract documentation and do much of the training/tweaking needed.
I stay away from these tools made to help front ends. I need a front-end, I am not the front-end. I try OCRfeeder. When it comes to a few pages, it can be better than typing the pages.

Quote:
I personally would just err on the side of the paid OCR programs, ESPECIALLY when dealing with non-English works, or works with lots of accented characters. While the proprietary OCR programs are not zero dollars initially, they would save you A TON of time in all of your post-OCR processing steps (which is where you WILL spend most of your time). The more accurate/clean you can get your input, you will have to spend MUCH less time cleaning, and getting the document into a readable state.
You do have a point.

Quote:
Besides that, you can use GIMP/Inkscape/Imagemagick in order to manipulate the images fine. I prefer using all free software over proprietary whenever I can, but sadly, OCR is just one area where the free solutions don't hold much of a candle.
Well, Scan Tailor might help a lot more. But Jellby is right and I won't call names the users who just groom their post count. I just say "bye bye" and add them on ignore.
shevirsy is offline   Reply With Quote
Old 09-13-2014, 05:39 AM   #29
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by shevirsy View Post
Thanks for the links. I know the wikipedia article. It's depressing. Abbyy Finereader and that's about all. I was hoping for some missed gem.
There are a few proprietary programs that aren't on that list, and it does seem like that Wikipedia comparison COULD use some updating (for example, it says Finereader's latest version says 11, when 12 came out earlier this year).

If you want free, Tesseract is probably the best bet.

Quote:
Originally Posted by shevirsy View Post
Well, Scan Tailor might help a lot more. But Jellby is right and I won't call names the users who just groom their post count. I just say "bye bye" and add them on ignore.
I used Scan Tailor when I first started to get into this, but now I lean in favor of the tools just built directly in Finereader. I find that Scan Tailor manipulated the original source images a little TOO much for my liking. (Another reason to lean towards the proprietary programs instead of free, a lot of the image manipulation tools are built-in, and allow easy tweaks/comparisons with the original source, while with something like Tesseract, you will get ONLY the OCR portion).

Also, keep in mind that Scan Tailor was really only built as a MIDDLEWARE program, to fit into a workflow like this:

Dirty/Speckled/Warped/Crappy scans/photos -> Scan Tailor -> OCR program.

It was made to try to clean up the images, so that OCR can (potentially) be more accurate.

Only thing I have found that Scan Tailor does better than Finereader is handling speckled documents, although with all of the negative baggage that comes with Scan Tailor, I have settled on cleaning speckles directly using Imagemagick.

Last edited by Tex2002ans; 09-13-2014 at 05:45 AM.
Tex2002ans is offline   Reply With Quote
Old 09-13-2014, 06:32 AM   #30
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,463
Karma: 10684861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by shevirsy View Post
GIMP? No OCR.
ImageMagick? No OCR.
Popper OCR? Not that I know.

But if you say "we got the message already" would you PLEASE answer the ones I asked, before you answer the ones I haven't asked?
step 1 - cut the pdf - margins with pagenumbers, all you do not want to OCR using pdfscissors - a small program in Java run directly from the net. Works on Windows, Linux, Mac
Step 2 - program convert from imagemagick to parse pdf into bitmaps. Works on Windows, Linux, Mac
Step 3 - use Tesseract - an open source OCR. Works on Windows, Linux, Mac
Step 4 - use advanced text editor (in my case Vim) to format the text that has broken lines by default. The paragraphs are separated by empty lines, so it is very easy to join all the lines that are not separated by empty line. Works on Windows, Linux, Mac
(Commands for Vim in normal mode [press Escape twice to to get there]:
:set tw=10000
gggqG
[gg means go to the first line of file
gq means "rewrap the text to text width set with previous command :set tw=10000"
G means "to the end of the file"]
)

Tesseract is now being developed by Google - it used to be heavy duty commercial OCR engine
http://en.wikipedia.org/wiki/Tesseract_%28software%29

I have used this to OCR files on Linux.
For step 2 and 3 I have used following script:

Code:
#!/bin/sh
STARTPAGE=13 # set to pagenumber of the first page of PDF you wish to convert
ENDPAGE=253 # set to pagenumber of the last page of PDF you wish to convert
SOURCE=book.pdf # set to the file name of the PDF
OUTPUT=book.txt # set to the final output file
RESOLUTION=600 # set to the resolution the scanner used (the higher, the better)

touch $OUTPUT
for i in `seq $STARTPAGE $ENDPAGE`; do
    convert -monochrome -density $RESOLUTION $SOURCE\[$(($i - 1 ))\] page.tif
    echo processing page \[$(($i - 1 ))\]
    tesseract page.tif tempoutput
    cat tempoutput.txt >> $OUTPUT
done
There are various graphical front-ends for Tesseract, so you do not HAVE to use commandline. But this is what worked for me.

Please note that output from Tesseract is txt file that doesn't contain formatting info, such as bold, italics, that other [commercial] programs can produce. An example of a really good commercial program is Abbyy FineReader.
At work I use very old version of Recognita, that doesn't process pdf, so I have to convert with imagemagick. But, it was bundled with a scanner that we purchased very long time ago.
At work I also use Readiris - does process pdf, was bundled with HP scanner/printer/copier/fax combo some 5 years ago.

Last edited by kacir; 09-13-2014 at 06:59 AM.
kacir is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
no text extraction for pdf with images and OCR fxp33 Conversion 7 12-15-2015 07:22 AM
Cover images for pdf files on Kindle PW blz777 Amazon Kindle 0 07-21-2013 10:45 AM
Google Adds OCR for PDF Files kjk News 0 06-22-2010 02:27 PM
Can I view images in PDF files ? eisho Sony Reader 1 08-03-2008 08:49 PM
Sony reader for PDF files: pages as images claudioita Sony Reader 3 07-30-2007 02:46 PM


All times are GMT -4. The time now is 11:06 PM.


MobileRead.com is a privately owned, operated and funded community.