Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old 10-06-2014, 12:20 PM   #1
ittiandro
Connoisseur
ittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notes
 
Posts: 64
Karma: 24500
Join Date: Nov 2013
Device: JuliusvonJD
Problems converting K2PDF Opt files to EPUB

I have converted a number of PDF scanned books with the k2PDF OPT application in order to read them with my Galaxy tablet, which was not possible before the conversion. The k2PDF Opt conversion works very well with EZPDF Reader. However I wanted to convert them to EPUB because EPUB readers such as Cool Reader and FBReader have a much better interface and allow more control of the page layout ( Background colours, font properties, etc. )
I have tried several editing/conversion apps ( Nuance, ABBYY, Wondershare, Power PDF, etc) with and w/out the OCR option, but I hit a brick wall: the EPUB conversion takes place but the font in my tablet's EPUB reader is very small, almost unreadable and the size cannot be changed. In addition, the converted pages are in a smaller window of their own on a white background and nothing can be changed in them. Only the app background colour outside the white area can be changed, but it serves no purpose.
I got the impression that the EPUB conversion still cannot get rid of the original scanned PDF features and that what I am getting is still an image. Is there anyway to convert to a true, fully controllable EPUB format, like all the other original EPUBS I have?

Thanks

Ittiandro
ittiandro is offline   Reply With Quote
Old 10-07-2014, 03:31 PM   #2
eschwartz
Ex-Helpdesk Junkie
eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.eschwartz ought to be getting tired of karma fortunes by now.
 
eschwartz's Avatar
 
Posts: 19,422
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
You will need to OCR it.

If you do this sort of stuff a lot, it might be worth it to invest in ABBYY Finereader, which is supposed to be the best OCR software available.

If it is a one-time thing, you may want to just settle for Tesseract, the best open-source OCR.

Our resident OCR expert here on MobileRead, @Tex2002ans, would heartily recommend the investment in purchasing Finereader.
eschwartz is offline   Reply With Quote
Old 10-07-2014, 09:32 PM   #3
ittiandro
Connoisseur
ittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notes
 
Posts: 64
Karma: 24500
Join Date: Nov 2013
Device: JuliusvonJD
Quote:
Originally Posted by eschwartz View Post
You will need to OCR it.

If you do this sort of stuff a lot, it might be worth it to invest in ABBYY Finereader, which is supposed to be the best OCR software available.
Thank you
I tried ABBYY Fine Reader 11 Corporate Ed on my friends's computern the New Task tab I've chosen the E-Book, File PDF ( Image) to EPUB option,
which seemed to be exactly what I wanted to do, but the EPUB conversion does not render charts and diagrams, only some scribble .In addition, I have not seen any OCR option, unless it kicks in automatically.
THe source PDF( Image) file was already a k2PDF OPT conversion from the original PDF scanned file.I don't know if I should have used perhaps this original file. It is getting complicated but there must be a way out!

Thanks

Ittiandro
ittiandro is offline   Reply With Quote
Old 10-08-2014, 02:59 AM   #4
Toxaris
Wizard
Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.Toxaris ought to be getting tired of karma fortunes by now.
 
Toxaris's Avatar
 
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
You would be better of by OCR it to a Word or HTML file than an ePUB file. Then you would be able to clean up the mess better. Charts and diagrams should be converted into images. After analysis of the PDF, manually set those to images.
Don't run the whole OCR process in one go (unless it is a very simplistic book). First Analyze to see if all the areas are correct and then the read phase.

It is always better to start from the original of course.
Toxaris is offline   Reply With Quote
Old 10-08-2014, 07:48 PM   #5
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by eschwartz View Post
Our resident OCR expert here on MobileRead, @Tex2002ans, would heartily recommend the investment in purchasing Finereader.
I wrote a lot about OCR in this thread:

https://www.mobileread.com/forums/sho...d.php?t=243327

And many of the pitfalls of the free solutions compared to the paid, and areas where OCR is lacking, and areas where you will have to do a lot of manual fixing.

For a simple novel, the free stuff would probably work just fine... but once you start getting into more complicated books/layouts, things start getting hairy with the free solutions. I have a more detailed list in that post, but things like footnotes, figures/tables, dropcaps, superscript/subscript, etc. etc.

Also, if you follow the pyramid of links to more of my explanation posts, they explain every single thing with OCR, and how to go from PDF -> EPUB.

Quote:
Originally Posted by ittiandro View Post
I tried ABBYY Fine Reader 11 Corporate Ed on my friends's computern the New Task tab I've chosen the E-Book, File PDF ( Image) to EPUB option, which seemed to be exactly what I wanted to do, but the EPUB conversion does not render charts and diagrams, only some scribble .In addition, I have not seen any OCR option, unless it kicks in automatically.
Once you open up Finereader, you need to push File - Open PDF File/Image, and find where your PDF is and open it. After you open the PDF, Finereader should look like something along these lines. What you want to do then is press Read:

Click image for larger version

Name:	Finereader1.png
Views:	317
Size:	58.3 KB
ID:	129396

Finereader will then take a while trying to figure out the layout of the book (Text/Images/Tables), and OCR the entire book.

Text will get a Green rectangle around it, Images get a Red rectangle, Tables get a Blue rectangle.

Then you will have to manually go through and fix any mistakes Finereader finds in the layout. For example, here you can see that the dropcap 'T' was accidentally recognized as an image (see the red box):

Click image for larger version

Name:	Finereader2.png
Views:	262
Size:	79.4 KB
ID:	129397

What you want to do is use the Text/Picture/Table buttons, or readjust the boxes by dragging the edges:

Click image for larger version

Name:	Finereader3.png
Views:	251
Size:	81.8 KB
ID:	129398

You can see that an unrecognized box is slightly lighter color (Light Green/Blue/Red). You want to right click on the page, and press "Read Selected Pages":

Click image for larger version

Name:	Finereader4.png
Views:	293
Size:	83.0 KB
ID:	129399

Then you have to go through the entire book. Making sure that all your charts are in Image (Red) boxes, all the Text (Green) boxes, and Tables (Blue) boxes.

Quote:
Originally Posted by ittiandro View Post
THe source PDF( Image) file was already a k2PDF OPT conversion from the original PDF scanned file.I don't know if I should have used perhaps this original file.
Always work from as close to the original source as possible. In this case, you have the original PDF, so use it.

Quote:
Originally Posted by Toxaris View Post
You would be better of by OCR it to a Word or HTML file than an ePUB file.
The EPUB export is definitely buggy with footnotes in particular (makes me want to pull my hair out). It tries to automatically create links at the end of the chapters that jump back/forth (like in your typical ebook), but many times entire footnotes just disappear into thin air, or it never "links" them (and just keeps the footnotes in the regular flow of text). Besides that, I haven't ran into many other problems with EPUB output.

Depending on which tools you are more comfortable with, you might work much better in Word. If you do export to DOC(X), I would highly recommend Toxaris's ePUB Tools (see the bottom of his signature).

If you are more comfortable working directly in HTML, you might prefer the EPUB output.

Either way, you would still have to do a lot of A/B checking and fixing.

Last edited by Tex2002ans; 10-08-2014 at 07:52 PM.
Tex2002ans is offline   Reply With Quote
Old 10-09-2014, 12:07 AM   #6
ittiandro
Connoisseur
ittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notes
 
Posts: 64
Karma: 24500
Join Date: Nov 2013
Device: JuliusvonJD
Thank you Toxaris

I have redone the EPUB conversion following your suggestion to first convert diagrams and charts into images. One step forward is that now the document opens as an EPUB with my Android e-reader ( Cool Reader) and I have gained full control of the font and the page layout. The text is fine, however charts, drawings and diagrams are still not rendered at all, which means they have not really been converted into images.. So I am back to square 1.
Here is what I did, thinking to convert charts and diagrams to images: on the IMAGE side of each page opened, there is a “ change area type “ function which allows to change the way different areas of the page are read (as text, table, picture, etc). When analyzing the pages, I kept the TEXT option for the text areas ( obviously!) but for the charts and diagrams areas I have manually selected the “ picture” option, which I thought it was the same as the term IMAGE you used in your reply, but it still doesn’t work.
I also converted to HTML, but I don’t know what to do after the conversion. I suppose the HTML conversion is an intermediary step in order to eventually get a fully functional EPUB conversion properly showing charts and diagrams…

Perhaps it is because I am still not too clear about the sequence of the ABBYY Fine Reader : first there is an OPEN button on the menu bar, which allows to open the file , then there is a READ button and then there is supposed to be an OCR function.
Well, there is no OCR button or function in the menu and also I don’t understand the difference between OPEN, READ and ANALYSE. In fact as soon as I open the document and the pages start scrolling, it says that the pages are being RECOGNIZED, so they must have been read somehow! Why then a separate READ button?
If anybody wants to lend me a helping hand, I’d more than happy to hear from you or other experts , possibly familiar with the ABBYY Fine Reader software.
Thank you

Ittiandro
ittiandro is offline   Reply With Quote
Old 10-09-2014, 01:34 AM   #7
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by ittiandro View Post
Well, there is no OCR button or function in the menu and also I don’t understand the difference between OPEN, READ and ANALYSE.
Open - This is how you get a PDF/whatever to open up in Finereader. Normally, you would only feed a single PDF into it, BUT, you are also free to open up multiple PDFs/images at once (for example, I recently fed in an entire journal, where each article was split into separate PDFs).

Analyze - This does only the step of recognizing which areas are Text/Images/Tables. This goes through a page and puts the Light Green/Red/Blue recognition boxes. Also, if you mark any areas with a "Recognition Area" box (Gray), it will look at that specific section, and determine if what is in the Gray Box is Text/Images/Tables.

Read - This actually does the actual OCR. And if you haven't ANALYZED the page yet, it will encompass that step as well. This turns all of those Light Green/Red/Blue boxes into Dark Green/Red/Blue boxes.

Quote:
Originally Posted by ittiandro View Post
In fact as soon as I open the document and the pages start scrolling, it says that the pages are being RECOGNIZED, so they must have been read somehow! Why then a separate READ button?
Depends on how you have your Finereader set up. For example, I have mine set to do NOTHING when I open up a document.

I run the Analysis and Reading separately as manual button presses... when you are working on very large documents, it would be a huge waste of CPU power to Analyze + Read (OCR) the entire document, only to go through it and do manual fixes, only to run those steps all over again.

If this bothers you, you are free to change this setting under Tools - Options - Scan/Open. You can then set:
  • "Automatically read acquired page images"
  • "Automatically analyze acquired page images"
  • "Do not read and analyze acquired page images"

I personally prefer option 2 or 3.

Quote:
Originally Posted by ittiandro View Post
If anybody wants to lend me a helping hand, I’d more than happy to hear from you or other experts , possibly familiar with the ABBYY Fine Reader software.
Feel free to look at my post above as well. I made nice images for you, that was a lot of hard work.

Last edited by Tex2002ans; 10-09-2014 at 01:43 AM.
Tex2002ans is offline   Reply With Quote
Old 10-10-2014, 09:41 AM   #8
ittiandro
Connoisseur
ittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notes
 
Posts: 64
Karma: 24500
Join Date: Nov 2013
Device: JuliusvonJD
Thanks Tex2002ans. You did a terrific work.
I'll be looking thoroughly at your suggestions/explanations and let you know if I succeed or if I need further help.
One thing,though: you mention the HTML conversion as a possibility. If it is meant to be a final, self-contained solution, without further conversions, it won't be good for me, because my goal is to get a viable EPUB conversion identical to the original PDF text, especially regarding the non-text parts ( diagrams,etc). If on the other hand the HTML conversion is an intermediate step facilitating the final EPUB conversion, I might be interested to try it, but I don't know what to do after the HTML ( or Word) conversion.
Let me try your suiggestions first and then I might get back to the Forum .

Thanks anyway

Ittiandro
ittiandro is offline   Reply With Quote
Old 10-10-2014, 12:15 PM   #9
ittiandro
Connoisseur
ittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notes
 
Posts: 64
Karma: 24500
Join Date: Nov 2013
Device: JuliusvonJD
Rendering non-text areas for EPUB conversion with ABBYY Fine Readerder

I have attempted a new EPUB conversion following your instructions in regard to selecting the proper area types for different parts of the page(s), but these non-text areas are still not rendered properly or not rendered at all in the conversion when I open the EPUB file in my tablet.
I enclose a few samples pages. If somebody wants to be kind enough to have a look at them and may be tell me what I am not doing or doing wrong, i'd appreciate.
Being a scientific book ( physics) there are quite a few math symbols and special characters which may be beyond proper recognition by the ABBYY software ( and by me, since I am not into maths), but this does not worry me too much. All I am striving for is to get a proper rendition of the basic figures, diagrams and tables, in order to get a basic understanding of some of the issues.

Thanks

Ittiandro
Attached Files
File Type: pdf Schumacher_SamplesSource.pdf (146.5 KB, 183 views)
ittiandro is offline   Reply With Quote
Old 10-10-2014, 09:48 PM   #10
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by ittiandro View Post
I have attempted a new EPUB conversion following your instructions in regard to selecting the proper area types for different parts of the page(s), but these non-text areas are still not rendered properly or not rendered at all in the conversion when I open the EPUB file in my tablet.
.... I can't gather anything from this PDF. Can you maybe take screenshots of what your Finereader page looks like? Do you have red boxes around the Figures?

And you haven't shown what the EPUB output is either.

If you push View - Image and Text Window, you should be able to see what Finereader will be outputting.

Click image for larger version

Name:	FinereaderSideBySide.png
Views:	256
Size:	96.7 KB
ID:	129504

Look at my screenshots of Finereader above, you see the left half of the screen shows the original PDF + Green/Red boxes? And the right half of my screen where it shows the OCR text (with blue highlights around unsure characters)? Does it look similar on your end?

The stuff that appears in the "View" Window, is what will appear when you export the file. Can you see the figures in the View window?

Quote:
Originally Posted by ittiandro View Post
I enclose a few samples pages. If somebody wants to be kind enough to have a look at them and may be tell me what I am not doing or doing wrong, i'd appreciate.
Is this the original source? Perhaps you accidentally sent a few pages out of Finereader?

If that is the case, you might also want to go into Options - Save - PDF, and set "Image Settings" to "Best quality (source image resolution)". This will make sure the PDF output matches the original, and doesn't get super compressed into death.

You might also want to set Save Mode to "Text Under the Page Image". (This makes sure that the original scan is still showing, and it just hides the OCRed text behind it).

Quote:
Originally Posted by ittiandro View Post
Being a scientific book ( physics) there are quite a few math symbols and special characters which may be beyond proper recognition by the ABBYY software ( and by me, since I am not into maths), but this does not worry me too much. All I am striving for is to get a proper rendition of the basic figures, diagrams and tables, in order to get a basic understanding of some of the issues.
Ouch... I would highly recommend against trying to make an EPUB of a physics book, ESPECIALLY for your first time. There are WAY too many figures, complex equations, sub/superscripts, inline equations, greek/mathematical/weird symbols (that Finereader won't get correct).

It would absolutely take forever, even for someone who knows what they are doing (let me tell you... I wouldn't touch digitizing a physics book with a ten foot pole). :P

Last edited by Tex2002ans; 10-10-2014 at 09:58 PM.
Tex2002ans is offline   Reply With Quote
Old 10-11-2014, 07:51 AM   #11
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
You would end up making images of most of the pages in the book.

Tables are handled reasonably well by fine reader, if you go through the book page by page and see how Finereader has handled them. There will be a number of pages where an obvious table is handled as text. You right click on the page display and delete all the areas, then click on the table button and select the table area, and repeat for all pages. Ditto for pictures too, with the pictures button . Then read (recognize) the book and it should do better.

Finereader will save to epub, but may not be to your liking. To HTML is better, then import it into Sigil or Calibre editor.

Digitizing textbooks has been a constant request, but there just isn't support for many scientific symbols used in most readers. Nor by design is there support for fixed layout which you mentioned. Text reflows in epubs making fixed pages difficult to impossible.

The solution for these books as it stands is large tablets that can display the PDFs at full size. But you aren't going to stick them in your coat pocket!
mrmikel is offline   Reply With Quote
Old 10-12-2014, 12:58 PM   #12
ittiandro
Connoisseur
ittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notes
 
Posts: 64
Karma: 24500
Join Date: Nov 2013
Device: JuliusvonJD
Quote:
Originally Posted by Tex2002ans View Post
.... I can't gather anything from this PDF. Can you maybe take screenshots of what your Finereader page looks like? Do you have red boxes around the Figures?

And you haven't shown what the EPUB output is either.


Is this the original source? Perhaps you accidentally sent a few pages out of Finereader?


Ouch... I would highly recommend against trying to make an EPUB of a physics book, ESPECIALLY for your first time. There are WAY too many figures, complex equations, sub/superscripts, inline equations, greek/mathematical/weird symbols (that Finereader won't get correct).

It would absolutely take forever, even for someone who knows what they are doing (let me tell you... I wouldn't touch digitizing a physics book with a ten foot pole). :P

The sample pages I sent you were not from the original PDF scanned book ( I must have lost it) but from a k2PDF OCR conversion.
As you can see from the attached, the non-text items ( drawings, diagrams, etc) are (almost) O.K. with the exception of page 2 where the drawing at the top of the original page is squarely missing. All in all, the conversion is severely flawed because all of the special math characters and symbols of physics are misread and converted into other characters. In addition, the EPUB conversion does not follow the original pagination, spaces and breaks between paragraphs and/or between the drawing explanations and the main text are ignored, etc. Perhaps, as you said, physics texts are beyond recognition by ABBYY or any similar software, however sophisticated it may be.
In view of this I think there is no point in pursuing this matter..
Since I believe , however, that you, like me, are one of those for whom the trip is just as fun as the final destination, even though it may be unreachable, I attach for your information the screenshots of a few sample pages as read by ABBYY as well as their EPUB conversion.

I am embarking now on another conversion job which I thought would be easier, but I am having second thoughts..
I have some PDF ( scanned) books containing Greek texts with the English translation side by side.
Even though it is classic Greek, I thought that ABBYY could read it because the ancient Greek characters are exactly the same as those of modern Greek ( with the exception of a number of accents and diacritic signs which have been dropped in modern Greek) and the language option of ABBYY lists Greek as one of the languages it can read.
Unfortunately, it is not the case: Greek characters ( or something like them!) appear in the conversion, but many of them are missing or distorted beyond recognition, words are jumbled together, etc. All in all, the text is readable with difficulty or plainly unreadable.Perhaps Greek readers have something to say.

Thanks again

Ittiandro
Attached Files
File Type: pdf ShuAbbyy1_Combined.pdf (829.4 KB, 253 views)
File Type: pdf ShuABBYYEPUB5_Combined.pdf (1.04 MB, 237 views)
ittiandro is offline   Reply With Quote
Old 10-12-2014, 03:06 PM   #13
mrmikel
Color me gone
mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.mrmikel ought to be getting tired of karma fortunes by now.
 
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
What are your system language settings while doing to recognition?
mrmikel is offline   Reply With Quote
Old 10-12-2014, 09:22 PM   #14
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by ittiandro View Post
As you can see from the attached, the non-text items ( drawings, diagrams, etc) are (almost) O.K. with the exception of page 2 where the drawing at the top of the original page is squarely missing.
Ok, so on the left half, do you see how the "Image Box" (red) is not covering the entire image?

For example, on Page 3, future/past figure is partically recognized as "Text", while the rest of the diagram is not recognized at all. You have to manually adjust this so that the entire figure is in ONE Image Box (red). You do this by hitting the picture button up top, and dragging a red square around the entire thing:

Click image for larger version

Name:	FinereaderPage3Before.png
Views:	225
Size:	42.8 KB
ID:	129630 Click image for larger version

Name:	FinereaderPage3After.png
Views:	243
Size:	46.6 KB
ID:	129631

Then you have to manually go through the entire book and do a similar thing for the rest of the pages. Any images you find that are NOT in a box, you have to put the correct Text/Image/Table box around it:

Click image for larger version

Name:	FinereaderPage1.png
Views:	201
Size:	118.0 KB
ID:	129629 Click image for larger version

Name:	FinereaderPage2After.png
Views:	241
Size:	54.6 KB
ID:	129632

Quote:
Originally Posted by ittiandro View Post
All in all, the conversion is severely flawed because all of the special math characters and symbols of physics are misread and converted into other characters.
Yep yep, and things like hats, vectors, dots, overline, integrals, inline equations, Greek symbols... those are all going to give you an extremely hard time.

Here is a large discussion we had when talking about digitizing math texts to ebooks. You would probably run into all the same exact problems:

https://www.mobileread.com/forums/sho...d.php?t=228413

There is a reason why many of these non-fiction books are not in EPUB yet. If you don't have the original source files, it would just take way too much manpower to digitize the entire thing. It is just not worth it for most books (extremely high cost to digitize, and very low sales).

Quote:
Originally Posted by ittiandro View Post
I am embarking now on another conversion job which I thought would be easier, but I am having second thoughts..
I have some PDF ( scanned) books containing Greek texts with the English translation side by side.
Sounds like another one that is on the "very hard" side of things. Multi-column texts are a pain in the butt. (I am currently digitizing a monthly newsletter, 2 and 3 column text. It is QUITE annoying and painstakingly slow.)

Especially in the case of two separate columns of text, Finereader is designed to tackle multi-column text such as journals. Where the text flows from the left bottom -> right top. Finereader will auto-merge those paragraphs/sentences for you because it assumes it is a continuation of the same text.

In your case, you would want left column -> left column on next page, right column -> right column on next page.

I would not recommend tackling this conversion either, unless you are MUCH more familiar with the tools.

Quote:
Originally Posted by ittiandro View Post
Even though it is classic Greek, I thought that ABBYY could read it because the ancient Greek characters are exactly the same as those of modern Greek (with the exception of a number of accents and diacritic signs which have been dropped in modern Greek) and the language option of ABBYY lists Greek as one of the languages it can read.
Ouch again... Hopefully your scan is much higher quality as well, those accent signs are brutal. It takes me forever just to transcribe a sentence of Greek (heck, even single words take a while in some cases).

Doitsu pointed me to this resource, which might make it easier to do words with Greek Symbols:

http://www.lexilogos.com/keyboard/greek_ancient.htm

I also enjoy the organization of this Wikipedia article in order to visualize some of those harder accented characters:

https://en.wikipedia.org/wiki/Greek_diacritics

Quote:
Originally Posted by ittiandro View Post
Unfortunately, it is not the case: Greek characters ( or something like them!) appear in the conversion, but many of them are missing or distorted beyond recognition, words are jumbled together, etc. All in all, the text is readable with difficulty or plainly unreadable. Perhaps Greek readers have something to say.
Again, this is probably going to be an even MORE painstaking undertaking, but you might have to Train Finereader to make this case slightly more accurate. You do this by going into Tools - Options - Read, and under "Training", you will want to select "Use built-in and user patterns" or "Use only user pattern".

Click image for larger version

Name:	FinereaderPattern.png
Views:	233
Size:	8.3 KB
ID:	129633
  • Use built-in and user patterns
    • Finereader will do its best to OCR, but it will ask you whenever it runs across something it is "unsure" about.
  • Use only user pattern
    • You will have to build the OCR from scratch, character by character.
    • This is more useful if you have a font/language that is just absolutely abysmal in Finereader, or the scan is quite poor (but still human readable).

Then you want to open the Pattern Editor, and create a new Pattern (probably called "Ancient Greek"). Now, you will also want to make sure that "Read with training" has a checkmark in it. If you press Read on your book now, the "Pattern Training" window will pop up:

Click image for larger version

Name:	FinereaderPatternTraining.png
Views:	244
Size:	5.2 KB
ID:	129634

Now you will have to go through character by character, and tell it exactly what Greek + diacritic character that is. Be warned, this is PAINFULLY slow, absolutely brutal, and most likely will only work in THAT SPECIFIC FONT (I doubt you will be working on books with that exact font again).

Side Note: In my opinion, huge waste of time, better to spend your limited manpower elsewhere.

Side Note: While looking up information on this Greek stuff, I stumbled upon: http://wiki.digitalclassicist.org/OCR_for_ancient_Greek

Which lead to this: http://ancientgreekocr.org/

Perhaps that might work better than Finereader's default Greek recognition.

Last edited by Tex2002ans; 10-12-2014 at 09:25 PM.
Tex2002ans is offline   Reply With Quote
Old 10-14-2014, 03:00 PM   #15
ittiandro
Connoisseur
ittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notesittiandro can name that song in three notes
 
Posts: 64
Karma: 24500
Join Date: Nov 2013
Device: JuliusvonJD
Quote:
Originally Posted by Tex2002ans View Post
Ok,
Especially in the case of two separate columns of text, Finereader is designed to tackle multi-column text such as journals. Where the text flows from the left bottom -> right top. Finereader will auto-merge those paragraphs/sentences for you because it assumes it is a continuation of the same text.


Perhaps that might work better than Finereader's default Greek recognition.
Thanks
I realized my error in not defining properly the reading areas and in fact the non-text items came out much better after implementing your hints. The text rendering still remains problematic, though.
I could probably improve the pagination and correct the misprints manually, but, as you said, it would be painstakingly long. Going into fine tunings such as TRAINING and PATTERN-CREATING sounds a bit complicated and in the end I don’t know if the result will be worth the time consumed. I really think that, even discounting my lack of proficiency on this subject, we are reaching the limits of what readers like ABBYY can do, however sophisticated they may be..
Concerning the issue of the rendition of Greek texts, you mention that ABBYY may have problems with TWO COLUMNS material and suggest some ways to circumvent the snag.. In my case, though, my texts, technically, do not have TWO COLUMNS per page: the Greek text is entirely on one page and the English translation on the following page, synchronized and aligned with the Greek text paragraph by paragraph. I don’t know if this layout is read by ABBYY as a TWO- COLUMN pagination, thus adding to the problems.

Thank you again

Ittiandro
ittiandro is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[Old Thread] New to Calibre--problems converting files rxmom03 Calibre 4 07-04-2012 09:46 PM
Problems converting LRF to EPUB rbur Calibre 2 06-21-2010 06:28 PM
Help! Newbie having problems converting HTML/CSS files jackie_w Calibre 6 09-14-2009 04:53 PM
Problems converting ePub Feed files into Mobipocket Format torben Calibre 21 02-21-2009 02:42 PM
Problems Converting files using libprs500 nosfuerato Calibre 3 12-27-2007 08:33 AM


All times are GMT -4. The time now is 08:30 AM.


MobileRead.com is a privately owned, operated and funded community.