MobileRead Forums - View Single Post - Pandoc and Tesseract to keep images and TOC

gg4u · 11-18-2018, 09:59 AM

I achieve to convert a pdf book in text by:

1.
- use ghostscript and transform it to tif
- use tesseract to OCR the tif in txt
- use pandoc to convert txt to epub

gs -q -r600x600 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif sourcePDF.pdf -c quit

tesseract -o -l eng mybook.tif mybook

or:

2.
- use k2pdfopt to transform pdf to pdf formatted for e-reader

options:

-ocr t -ocrhmax 1.5 -ocrvis st

Version 1. was able to scan written text better, but lost images
Version 2. was able to keep images somehow, but characters are rendered a bit noisy.

In neither of the two approaches, I can have a TOC file - an hyperlinked index of the book.

Possibly, I would also like to remove words in pages headers - like the word "introduction" written in each top of the pages in the paper-book for the chapter "introduction.

I would like to reflow pdf to epub to:
- KEEP IMAGES
- be able to intervene on TOC to create an index
- possibly remove words in header of the page (could do with regex, eventually, or manually)
- reflow text to epub
- finally use calibre to handle epub > to kindle / e-readers

Could you advise what I am missing ?

How could I complete / edit the two approaches to have desired result?

11-18-2018, 09:59 AM	#1
gg4u Junior Member Posts: 7 Karma: 42206 Join Date: Nov 2018 Device: Kindle 8	Pandoc and Tesseract to keep images and TOC I achieve to convert a pdf book in text by: 1. - use ghostscript and transform it to tif - use tesseract to OCR the tif in txt - use pandoc to convert txt to epub gs -q -r600x600 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif sourcePDF.pdf -c quit tesseract -o -l eng mybook.tif mybook or: 2. - use k2pdfopt to transform pdf to pdf formatted for e-reader options: -ocr t -ocrhmax 1.5 -ocrvis st Version 1. was able to scan written text better, but lost images Version 2. was able to keep images somehow, but characters are rendered a bit noisy. In neither of the two approaches, I can have a TOC file - an hyperlinked index of the book. Possibly, I would also like to remove words in pages headers - like the word "introduction" written in each top of the pages in the paper-book for the chapter "introduction. I would like to reflow pdf to epub to: - KEEP IMAGES - be able to intervene on TOC to create an index - possibly remove words in header of the page (could do with regex, eventually, or manually) - reflow text to epub - finally use calibre to handle epub > to kindle / e-readers Could you advise what I am missing ? How could I complete / edit the two approaches to have desired result?