Thank you, j.p.s, your reference will be useful.
I still miss a step.
I am *creating* a document I want to convert in an epub.
I want to *convert* a pdf to an epub, and have as a final result:
- text that is rendered sharp (no OCR layer), it can be selected, highlighted and zoomable in the e-reader
- images such as photos, graphs and tables
I am doing the following:
1. I process a pdf of scanned images with ghostscript and convert it to tiff
gs -q -r600x600 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif sourcePDF.pdf -c quit
2. Apply tesseract to obtain txt
tesseract -o -l eng mybook.tif mybook
Or
Apply tesseract to obtain searchable pdf
Pros and Cons
With a txt I will have desired result on text, I can use asciidoctor for mark up,
but I miss extracted images.
With a searchable pdf, I see it contains desired images, but text is rendered as OCR plus there are the images containing text in background - I can't edit it to apply markup, text won't support the feature in the e-reader and it is not sharp (it looks like a rendered image).
I looked at the documnetation of asciidoctor, and have not found that I could process a searchable pdf *to* an epub - while it is clear I can process a txt to an epub.
It is less then desired to manually create and reference images on a txt, and then apply Asciidoctor - I miss a step.
Can I use the tools you suggest to *extract* text AND image catalog from the original file (tiff or searchable pdf) ?
I would need images ( photos, graphs, and tables ) be referenced on the text in output , as per:
https://asciidoctor.org/docs/asciido...ng-with-images
If asciidoc or asciidoctor won't extract images, which steps would you suggest to obtain final result - a txt file with references to extracted images, which I then could finalise with asciidoctor ?