Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 04-28-2018, 10:17 AM   #1
foice
Enthusiast
foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.
 
Posts: 26
Karma: 1524
Join Date: Apr 2018
Device: Android reader
PDF (with OCR) to ePub, is it possible to make a real ePub?

This may be close to the questions on "replicas" of a few days back, but I would like to check what I am doing before calling the attempt at least on the right track.

I have a PDF, which I have passed through OCR using Adobe Acrobat, and I would like to transform it in a epub that fits well on my android phone.

I could transform it in epub with calibre, but it was simply wrapping images of the pages in a epub, not what I wanted. I look for ePub file where each word is an entity and the whole page is composed as a result of the size of font and size of screen - that is what eBooks are for.

Is this achievable in a conversion of PDF (with OCR) into ePub?

If I understand its purpose and the output it gave me I think k2pdfopt is not what I want, as it still makes PDFs, just more suitable for small screens.

thanks for your help to put me on track.
roberto
foice is offline   Reply With Quote
Old 04-28-2018, 11:47 AM   #2
dwig
Wizard
dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.dwig ought to be getting tired of karma fortunes by now.
 
dwig's Avatar
 
Posts: 1,613
Karma: 6718541
Join Date: Dec 2004
Location: Paradise (Key West, FL)
Device: Current:Surface Go & Kindle 3 - Retired: DellV8p, Clie UX50, ...
Yes, it is possible, but no OCR available today is good enough for any conversion to automagically create an error free ePub. Generally, there will be a significant amount of manual editing necessary.

Read the stickies at the top of the Conversion sub-forum, particularily the one specific to PDF: https://www.mobileread.com/forums/sh...d.php?t=118605
dwig is offline   Reply With Quote
Advert
Old 04-28-2018, 01:52 PM   #3
foice
Enthusiast
foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.
 
Posts: 26
Karma: 1524
Join Date: Apr 2018
Device: Android reader
Thanks for this input. The post you mentioned, however, seems to concentrate on minor corrections on the output, improvements I'd say.

What I am getting is a qualitatively different thing from what I expected. I wanted the PDF to be trasnformed in text. Text that I thought would be nicely aranged in a ePub. None of that happened, I just go images as output. Is this something I can do anything about in Calibre or "Note if your PDF looks like complete garbage after conversion (i.e. nothing at all like the original text) there is nothing to be done about it using Calibre."

Thanks
Roberto
foice is offline   Reply With Quote
Old 04-28-2018, 06:40 PM   #4
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 21,731
Karma: 29711016
Join Date: Mar 2012
Location: Sydney Australia
Device: none
@foice - you need to process the PDF through an OCR engine. Abbyy Fine Reader is highly regarded, but it costs. Some of those tools will produce file that can be read with MS Word, from there you could use Toxaris's eBook Tools MS Word add-in, which has a number of features specifically targeted at transforming OCRd text to ePUB.

BR
BetterRed is online now   Reply With Quote
Old 04-28-2018, 07:04 PM   #5
foice
Enthusiast
foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.
 
Posts: 26
Karma: 1524
Join Date: Apr 2018
Device: Android reader
Hello, my PDF has OCR already. I have done it with Adobe Acrobat, is this sufficient? Should I export it to some non-PDF format? is that what you are really suggesting?
foice is offline   Reply With Quote
Advert
Old 04-28-2018, 09:24 PM   #6
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 21,731
Karma: 29711016
Join Date: Mar 2012
Location: Sydney Australia
Device: none
@foice - you need to get the text into an editable format (such as txt, html, docx etc). I haven't used Acrobat in decades, so I'm not sure what it can do.

I just remembered this: If you have a recent edition of MS Word (2013 or 2016 I think) you can open a PDF document and it'll convert it to DOCX. I've used a few times to good effect. The attachment shows side-by-side PDF (in PDFXchange) and DOCX (in Word 2016).

BR
Attached Thumbnails
Click image for larger version

Name:	2.JPG
Views:	1109
Size:	358.7 KB
ID:	163734  
BetterRed is online now   Reply With Quote
Old 04-28-2018, 09:26 PM   #7
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 45,359
Karma: 27182818
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
OCRing your PDF has presumably created a PDF with page images and a text layer underneath to facilitate searching. Converting such a PDF will not give you a text based ebook. You should get your OCR software to generate an actual text based document for that. Either a PDF with no page images or a txt file or similar.
kovidgoyal is online now   Reply With Quote
Old 04-29-2018, 10:54 AM   #8
foice
Enthusiast
foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.
 
Posts: 26
Karma: 1524
Join Date: Apr 2018
Device: Android reader
Thanks a lot for explaining this subtlety. I have made a word file, which looks pretty reasonable in some parts, but is still full of spelling mistakes and wrong recognitions (ligatures and beyond).

I read of a Toxaris' tool for word to fix a bit the text, but is for windows only. Anything for mac by chance?
foice is offline   Reply With Quote
Old 04-29-2018, 09:27 PM   #9
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 21,731
Karma: 29711016
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by foice View Post
Thanks a lot for explaining this subtlety. I have made a word file, which looks pretty reasonable in some parts, but is still full of spelling mistakes and wrong recognitions (ligatures and beyond).

I read of a Toxaris' tool for word to fix a bit the text, but is for windows only. Anything for mac by chance?
Not that I know of, but my knowledge of Macs and 3rd party Mac software is limited.

If you have DOCX then after doing what you can in Word, assuming you know how to use it, you could import it into the calibre editor and make use of its Spellchecker, Search and Replace features to fix errors. If you need help doing that create a new thread in the Editor subforum.

Or you could use the Sigil Epub editor, it has a DOCX import plugin and its ePubTidy plugin can be used to fix common OCR problems. I've used the DOCXImport plugin to good effect, but I've not used ePubTidy. Sigil has a Spellchecker and Search and Replace that are very similar to calibre editor.

BR

Last edited by BetterRed; 05-01-2018 at 07:19 PM. Reason: grammar
BetterRed is online now   Reply With Quote
Old 05-01-2018, 06:34 AM   #10
foice
Enthusiast
foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.foice once ate a cherry pie in a record 7 seconds.
 
Posts: 26
Karma: 1524
Join Date: Apr 2018
Device: Android reader
Thanks for these tips! I will try!
foice is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Best practice to OCR and convert PDF to text or html or epub crankypants ePub 15 12-14-2015 08:00 PM
OCR to EPUB Best Workflow Pumpkin Soup Workshop 19 04-22-2014 03:05 PM
put favorite articles in EPUB or PDF format together to make new doc for Kindle calibrecali ePub 1 07-22-2013 12:28 PM
A real PDF to epub/djvu/rtf/html software?. DsOft ePub 35 01-02-2011 03:57 PM
How to make a PDF table of contents work in epub ajbrutico Calibre 3 09-26-2010 09:31 AM


All times are GMT -4. The time now is 05:06 AM.


MobileRead.com is a privately owned, operated and funded community.