Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 08-03-2013, 07:16 AM   #496
Pagliuz
Junior Member
Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.
 
Posts: 3
Karma: 6000
Join Date: Aug 2013
Device: Sony PRS-T2
Problems with figures!

Hello willus,
first of all I want t thank you a lot for the great work you have put down, this program is just awesome.

Now, the problems: i have a Sony PRS-T2 reader (nearly same specs of Kindle 2) and i want to convert some PC-pdf's to reader pdf's.

As you can see in the attachment, when it comes to figures, i obtain that orrible result (figures that should be much smaller and part of a page, not occupy an entire page and be splitted in more consecutive pages). The formulas and the text of the book look very good, it's only a problem of some figures!

Can you help me?

Thanks in advance!

P.S. If u want, i can send to you the original djvu that i have to convert

P.P.S Should i use some particular options because i have not a kindle but a sony?
Attached Thumbnails
Click image for larger version

Name:	errors.png
Views:	333
Size:	190.2 KB
ID:	108852  
Pagliuz is offline   Reply With Quote
Old 08-03-2013, 04:14 PM   #497
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by Pagliuz View Post
... i have a Sony PRS-T2 reader (nearly same specs of Kindle 2) and i want to convert some PC-pdf's to reader pdf's.

As you can see in the attachment, when it comes to figures, i obtain that orrible result (figures that should be much smaller and part of a page, not occupy an entire page and be splitted in more consecutive pages). The formulas and the text of the book look very good, it's only a problem of some figures!

Can you help me?

Thanks in advance!

P.S. If u want, i can send to you the original djvu that i have to convert

P.P.S Should i use some particular options because i have not a kindle but a sony?
@Pagliuz -- Yes, please post the original DJVU file as an attachment. The (default) Kindle 2 options should work well for a Sony if it has the same size and screen resolution.
willus is offline   Reply With Quote
Advert
Old 08-04-2013, 04:20 AM   #498
Pagliuz
Junior Member
Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.
 
Posts: 3
Karma: 6000
Join Date: Aug 2013
Device: Sony PRS-T2
Quote:
Originally Posted by willus View Post
@Pagliuz -- Yes, please post the original DJVU file as an attachment. The (default) Kindle 2 options should work well for a Sony if it has the same size and screen resolution.
Thank you for answering, i have sent you a pm with the link of the original djvu.

Yeah, PRS-T2 is nearly the same as Kindle 2 (6 inches diagonal, 600x800, or little less for sony, like 580x790) so i think that those options should be ok!

Just for my personal information: what is the problem with the pdf output? Why those figures?

If u obtain a good result, could you tell me the options that you have used during conversion, so that i can experiment from these base options for future conversions?

Last edited by Pagliuz; 08-04-2013 at 05:07 AM.
Pagliuz is offline   Reply With Quote
Old 08-04-2013, 05:30 PM   #499
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by Pagliuz View Post
Thank you for answering, i have sent you a pm with the link of the original djvu.

Yeah, PRS-T2 is nearly the same as Kindle 2 (6 inches diagonal, 600x800, or little less for sony, like 580x790) so i think that those options should be ok!

Just for my personal information: what is the problem with the pdf output? Why those figures?

If u obtain a good result, could you tell me the options that you have used during conversion, so that i can experiment from these base options for future conversions?
@Pagliuz -- I think you have two reasonable options for this file:

Option 1. Because the text is only 4.7 inches wide if you strip away the margins, it reads pretty well without any re-flow if you use the standard "fit width" mode (with -n- to turn off native PDF output since your source is a DJVU file):

k2pdfopt -mode fw -n- myfile.djvu

This is sure not to mess up figures or alignment, so you get the best looking output, but it may be that the text is too small for you to read. If that is the case, then:

Option 2. Try to make sure k2pdfopt only re-flows the text and not the figures or equations:

k2pdfopt -col 1 -whmax 0.2 myfile.djvu

The "-col 1" will disable detection of multiple columns, and the -whmax 0.2 is an undocumented option which tells k2pdfopt not to re-flow any image taller than 0.2 inches (i.e. anything that's taller than a standard line of text). This will do a better job of keeping some of the figures from being interpreted as lines of text which get re-flowed. It's not perfect, but it seems to be better than the default conversion you got.

Another option is to add -mt 0.75, which will chop off the headers on each page so that the text is more continuous in the converted file, but this makes it harder to reference the original page numbers in the converted file.
willus is offline   Reply With Quote
Old 08-04-2013, 06:20 PM   #500
Pagliuz
Junior Member
Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.Pagliuz got an A in P-Chem.
 
Posts: 3
Karma: 6000
Join Date: Aug 2013
Device: Sony PRS-T2
Quote:
Originally Posted by willus View Post
@Pagliuz -- I think you have two reasonable options for this file:

Option 1. Because the text is only 4.7 inches wide if you strip away the margins, it reads pretty well without any re-flow if you use the standard "fit width" mode (with -n- to turn off native PDF output since your source is a DJVU file):

k2pdfopt -mode fw -n- myfile.djvu

This is sure not to mess up figures or alignment, so you get the best looking output, but it may be that the text is too small for you to read. If that is the case, then:

Option 2. Try to make sure k2pdfopt only re-flows the text and not the figures or equations:

k2pdfopt -col 1 -whmax 0.2 myfile.djvu

The "-col 1" will disable detection of multiple columns, and the -whmax 0.2 is an undocumented option which tells k2pdfopt not to re-flow any image taller than 0.2 inches (i.e. anything that's taller than a standard line of text). This will do a better job of keeping some of the figures from being interpreted as lines of text which get re-flowed. It's not perfect, but it seems to be better than the default conversion you got.

Another option is to add -mt 0.75, which will chop off the headers on each page so that the text is more continuous in the converted file, but this makes it harder to reference the original page numbers in the converted file.
Thank you Willus, both the solutions work pretty fine; i think the first one you said is the best!
You saved my day!

Keep on the fantastic work you are doing man!
Pagliuz is offline   Reply With Quote
Advert
Old 08-07-2013, 01:27 PM   #501
MaxStirner
Connoisseur
MaxStirner has a certain pleonastic somethingMaxStirner has a certain pleonastic somethingMaxStirner has a certain pleonastic somethingMaxStirner has a certain pleonastic somethingMaxStirner has a certain pleonastic somethingMaxStirner has a certain pleonastic somethingMaxStirner has a certain pleonastic somethingMaxStirner has a certain pleonastic somethingMaxStirner has a certain pleonastic somethingMaxStirner has a certain pleonastic somethingMaxStirner has a certain pleonastic something
 
Posts: 71
Karma: 18500
Join Date: Apr 2013
Device: Kindle Touch, Paperwhite
Sorry to bother you again Wilus but maybe you remeber my question about multilanguage support. Yesterday I was perusing through Tesserract google group without any speciffic reason and suddenly stumbled accross this post
https://groups.google.com/forum/#!ms...I/QMMHDV_GWRIJ
Don't know if this is of any help to you but just in case..
MaxStirner is offline   Reply With Quote
Old 08-07-2013, 10:52 PM   #502
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by MaxStirner View Post
Sorry to bother you again Wilus but maybe you remeber my question about multilanguage support. Yesterday I was perusing through Tesserract google group without any speciffic reason and suddenly stumbled accross this post
https://groups.google.com/forum/#!ms...I/QMMHDV_GWRIJ
Don't know if this is of any help to you but just in case..
Max--thank you very much! I'll have to see how that works in their code base and make sure I can replicate it from k2pdfopt. I'll try to make sure that feature works in the next release.
willus is offline   Reply With Quote
Old 08-17-2013, 08:25 PM   #503
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Dual language OCR example with k2pdfopt and Tesseract

Quote:
Originally Posted by MaxStirner View Post
Sorry to bother you again Wilus but maybe you remeber my question about multilanguage support. Yesterday I was perusing through Tesserract google group without any speciffic reason and suddenly stumbled accross this post
https://groups.google.com/forum/#!ms...I/QMMHDV_GWRIJ
Don't know if this is of any help to you but just in case..
Tesseract's dual language OCR actually seems to work in k2pdfopt v1.66, though not very well at all in my test case, where I mixed English and Chinese. I used this command:

k2pdfopt -ocr dual_english_chinese.pdf -mode copy -ocrlang language

where I substituted different values for language: eng, chi_tra, chi_tra+eng, and eng+chi_tra. See the attached files. The best results, by far, were using only chi_tra alone, which sort of defeats the purpose of dual language OCR(!), but each result was different, so I am assuming that the actual mechanism of passing lang1+lang2 to Tesseract is working and that this was just a particularly poor case for Tesseract. Maybe mixed European languages will work better?
Attached Thumbnails
Click image for larger version

Name:	dualocr_english_chinese_results.png
Views:	339
Size:	143.5 KB
ID:	109569  
Attached Files
File Type: pdf dual_english_chinese.pdf (45.5 KB, 257 views)
willus is offline   Reply With Quote
Old 08-18-2013, 01:24 AM   #504
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by CheriePie View Post
I get an error when trying to use Tesseract OCR engine on the 64-bit windows platform (v1.65). After selecting Tesseract for the OCR choice, I've left all other choices in that selection at their default. The only other change I'm making is the Device settings (d) for Kindle Paperwhite.

So this is the command line I've built:

Selected options:
"C:\Users\Cherie\Documents\My eBooks\Calibre Library\Jesse
Petersen\Club Monstrosity (124)\Club Monstrosity - Jesse Petersen.pdf"
-dev kpw -ocr t -ocrhmax 1.5 -ocrvis s



After hitting enter to begin the conversion, I get the following errors:

Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not find Tesseract data (env var TESSDATA_PREFIX = (not assigned)).
Using GOCR v0.49.

Reading 233 pages from C:\Users\Cherie\Documents\My eBooks\Calibre Library\Jesse
Petersen\Club Monstrosity (124)\Club Monstrosity - Jesse Petersen.pdf ...

Detecting document orientation ... No rotation necessary.

SOURCE PAGE 1 of 233 (7.5 x 9.4 in) ... 0 new pages saved.


And then it stops working completely at page 2, throwing up the standard k2pdfopt.exe has stopped working error dialog from Windows.

I don't get these errors using the Gocr engine, but I guess Tesseract is more accurate so I'd like to try to use that one if possible.
@CheriePie--Are you still around? If so, please send me a private message.
willus is offline   Reply With Quote
Old 08-21-2013, 01:19 AM   #505
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by kundor View Post
...By the way, it takes about 12 hours to OCR this document, which seems kind of silly when there is already a hidden text layer. Since it includes the location data, it seems like it might be possible to keep track of which words go with each chunk while you're slicing up the pages. Have you considered doing that?
This feature (using the native text in a PDF file in place of OCR) will be available in the next k2pdfopt release. I've implemented it (a la the -ttt option in the mudraw utility that comes with MuPDF) and tested it.
willus is offline   Reply With Quote
Old 08-25-2013, 03:13 PM   #506
state
Junior Member
state got an A in P-Chem.state got an A in P-Chem.state got an A in P-Chem.state got an A in P-Chem.state got an A in P-Chem.state got an A in P-Chem.state got an A in P-Chem.state got an A in P-Chem.state got an A in P-Chem.state got an A in P-Chem.state got an A in P-Chem.
 
Posts: 1
Karma: 6000
Join Date: Aug 2013
Device: kindle touch
Hi there,

I am very new to ereaders in general, and I am also very new to k2pdfopt. To make matters worse, I am not so savvy with computing. However, I did attempt to set up Tesseract and the environment variable, but I still get the error as shown in the screenshot. Any ideas? Do I have to set another environment variable for kdpdfopt itself?

Also, is there a kdpdfopt guide for dummies? I appreciate the help sections on the site, but it is still a bit too fast for me. I will be utilising the programme exclusive for creating pdfs from linguistics pdfs (typically two column, with diagrams and charts, classic science articles). Thank you!
Attached Thumbnails
Click image for larger version

Name:	kdpdfopt ocr.jpg
Views:	331
Size:	194.8 KB
ID:	110003  
state is offline   Reply With Quote
Old 08-26-2013, 12:37 AM   #507
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by state View Post
Hi there,

I am very new to ereaders in general, and I am also very new to k2pdfopt. To make matters worse, I am not so savvy with computing. However, I did attempt to set up Tesseract and the environment variable, but I still get the error as shown in the screenshot. Any ideas? Do I have to set another environment variable for kdpdfopt itself?

Also, is there a kdpdfopt guide for dummies? I appreciate the help sections on the site, but it is still a bit too fast for me. I will be utilising the programme exclusive for creating pdfs from linguistics pdfs (typically two column, with diagrams and charts, classic science articles). Thank you!
If you go to your C-drive, then the tesseract-ocr folder, there should be a "tessdata" folder, and inside that folder should be the English training files, which need to be extracted from the tar.gz file that you download from the Tesseract web site. It's a bit involved. Have you considered using Wallauer's Windows GUI from my third-party contributions page? I believe it will install the Tesseract files for you.

Are your linguistics PDFs mostly scanned or not? If they aren't scanned (if they are generated directly from a source file with the original text), you should be able to use "-mode 2col" and skip OCR altogether, e.g.

k2pdfopt -mode 2col myfile.pdf

Otherwise, OCR is probably the way to go. Sorry, there's no "for dummies" guide at the moment. All I've got is my help pages, but again, the Windows GUI may make things easier for you. You might also want to watch the video on the Native PDF page.

Edit: I've attached a screenshot of my Tesseract data folder (on my D drive). To OCR English text, you need the files shown, which have to be extracted from the downloaded training file (ends in .tar.gz).
Attached Thumbnails
Click image for larger version

Name:	tessfiles_english.png
Views:	312
Size:	71.0 KB
ID:	110017  

Last edited by willus; 08-28-2013 at 08:31 AM. Reason: Typo corrected
willus is offline   Reply With Quote
Old 08-27-2013, 11:57 PM   #508
mike2003
Member
mike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic something
 
Posts: 11
Karma: 18680
Join Date: Aug 2013
Device: none
Question

Why some times i have very small text line (1 or more)?


Last edited by WT Sharpe; 08-28-2013 at 12:53 AM. Reason: Hyperlink removed.
mike2003 is offline   Reply With Quote
Old 08-28-2013, 08:27 AM   #509
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,272
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Quote:
Originally Posted by mike2003 View Post
Why some times i have very small text line (1 or more)?

One small text line between two large ones is a bit unusual (at least I haven't seen it much), and it looks like you have plenty of space between the words. I would need to see your source document and any specific options you used for the conversion. It looks like you had a hyperlink removed--doesn't say why. Maybe it is copyrighted? Can you please PM (private message) it to me (or just the troublesome page)?

One option you might try is to reduce the required gap between words that enables breaking lines. This is specified by -ws, which defaults to 0.375. Maybe try -ws 0.3 or -ws 0.25.
willus is offline   Reply With Quote
Old 08-28-2013, 09:32 AM   #510
mike2003
Member
mike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic somethingmike2003 has a certain pleonastic something
 
Posts: 11
Karma: 18680
Join Date: Aug 2013
Device: none
Quote:
Originally Posted by willus View Post
One small text line between two large ones is a bit unusual
[Image exceeds guidelines - MODERATOR]

some times i have block of text with small size
Quote:
Originally Posted by willus View Post
This is specified by -ws, which defaults to 0.375. Maybe try -ws 0.3 or -ws 0.25.
nice, lool like worked with -ws 0.15

Last edited by Dr. Drib; 01-15-2014 at 11:25 AM.
mike2003 is offline   Reply With Quote
Reply

Tags
ebook apps, k5 tools, kindle tools, kindle touch, tools

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Viewing PDFs with another font Font PocketBook 4 11-12-2010 08:27 AM
Viewing Textbook PDFs... NJReader enTourage Archive 4 08-17-2010 05:17 PM
PRS-600 Restart bug while viewing PDFs? conundrum Sony Reader 2 03-04-2010 08:46 PM
More on viewing pdfs dso371 Bookeen 8 03-11-2008 07:15 PM
Viewing Untagged PDFs on Palm T|X Eroica Reading and Management 3 12-10-2007 01:44 PM


All times are GMT -4. The time now is 06:19 AM.


MobileRead.com is a privately owned, operated and funded community.