MobileRead Forums - View Single Post - Pandoc and Tesseract to keep images and TOC

gg4u · 11-24-2018, 08:33 AM

Thank you, j.p.s, your reference will be useful.

I still miss a step.

I am *creating* a document I want to convert in an epub.

I want to *convert* a pdf to an epub, and have as a final result:

- text that is rendered sharp (no OCR layer), it can be selected, highlighted and zoomable in the e-reader
- images such as photos, graphs and tables

I am doing the following:
1. I process a pdf of scanned images with ghostscript and convert it to tiff

gs -q -r600x600 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif sourcePDF.pdf -c quit

2. Apply tesseract to obtain txt
tesseract -o -l eng mybook.tif mybook

Or

Apply tesseract to obtain searchable pdf

Pros and Cons
With a txt I will have desired result on text, I can use asciidoctor for mark up,
but I miss extracted images.

With a searchable pdf, I see it contains desired images, but text is rendered as OCR plus there are the images containing text in background - I can't edit it to apply markup, text won't support the feature in the e-reader and it is not sharp (it looks like a rendered image).

I looked at the documnetation of asciidoctor, and have not found that I could process a searchable pdf *to* an epub - while it is clear I can process a txt to an epub.

It is less then desired to manually create and reference images on a txt, and then apply Asciidoctor - I miss a step.

Can I use the tools you suggest to *extract* text AND image catalog from the original file (tiff or searchable pdf) ?

I would need images ( photos, graphs, and tables ) be referenced on the text in output , as per:

https://asciidoctor.org/docs/asciido...ng-with-images

If asciidoc or asciidoctor won't extract images, which steps would you suggest to obtain final result - a txt file with references to extracted images, which I then could finalise with asciidoctor ?

11-24-2018, 08:33 AM	#5
gg4u Junior Member Posts: 7 Karma: 42206 Join Date: Nov 2018 Device: Kindle 8	Thank you, j.p.s, your reference will be useful. I still miss a step. I am creating a document I want to convert in an epub. I want to convert a pdf to an epub, and have as a final result: - text that is rendered sharp (no OCR layer), it can be selected, highlighted and zoomable in the e-reader - images such as photos, graphs and tables I am doing the following: 1. I process a pdf of scanned images with ghostscript and convert it to tiff gs -q -r600x600 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif sourcePDF.pdf -c quit 2. Apply tesseract to obtain txt tesseract -o -l eng mybook.tif mybook Or Apply tesseract to obtain searchable pdf Pros and Cons With a txt I will have desired result on text, I can use asciidoctor for mark up, but I miss extracted images. With a searchable pdf, I see it contains desired images, but text is rendered as OCR plus there are the images containing text in background - I can't edit it to apply markup, text won't support the feature in the e-reader and it is not sharp (it looks like a rendered image). I looked at the documnetation of asciidoctor, and have not found that I could process a searchable pdf to an epub - while it is clear I can process a txt to an epub. It is less then desired to manually create and reference images on a txt, and then apply Asciidoctor - I miss a step. Can I use the tools you suggest to extract text AND image catalog from the original file (tiff or searchable pdf) ? I would need images ( photos, graphs, and tables ) be referenced on the text in output , as per: https://asciidoctor.org/docs/asciido...ng-with-images If asciidoc or asciidoctor won't extract images, which steps would you suggest to obtain final result - a txt file with references to extracted images, which I then could finalise with asciidoctor ?