View Single Post
Old 11-24-2018, 07:33 AM   #5
gg4u
Junior Member
gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'
 
Posts: 7
Karma: 42206
Join Date: Nov 2018
Device: Kindle 8
Thank you, j.p.s, your reference will be useful.

I still miss a step.

I am *creating* a document I want to convert in an epub.

I want to *convert* a pdf to an epub, and have as a final result:

- text that is rendered sharp (no OCR layer), it can be selected, highlighted and zoomable in the e-reader
- images such as photos, graphs and tables

I am doing the following:
1. I process a pdf of scanned images with ghostscript and convert it to tiff

gs -q -r600x600 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif sourcePDF.pdf -c quit

2. Apply tesseract to obtain txt
tesseract -o -l eng mybook.tif mybook

Or

Apply tesseract to obtain searchable pdf

Pros and Cons
With a txt I will have desired result on text, I can use asciidoctor for mark up,
but I miss extracted images.

With a searchable pdf, I see it contains desired images, but text is rendered as OCR plus there are the images containing text in background - I can't edit it to apply markup, text won't support the feature in the e-reader and it is not sharp (it looks like a rendered image).

I looked at the documnetation of asciidoctor, and have not found that I could process a searchable pdf *to* an epub - while it is clear I can process a txt to an epub.

It is less then desired to manually create and reference images on a txt, and then apply Asciidoctor - I miss a step.

Can I use the tools you suggest to *extract* text AND image catalog from the original file (tiff or searchable pdf) ?

I would need images ( photos, graphs, and tables ) be referenced on the text in output , as per:

https://asciidoctor.org/docs/asciido...ng-with-images

If asciidoc or asciidoctor won't extract images, which steps would you suggest to obtain final result - a txt file with references to extracted images, which I then could finalise with asciidoctor ?
gg4u is offline   Reply With Quote