Pandoc and Tesseract to keep images and TOC
I achieve to convert a pdf book in text by:
1.
- use ghostscript and transform it to tif
- use tesseract to OCR the tif in txt
- use pandoc to convert txt to epub
gs -q -r600x600 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif sourcePDF.pdf -c quit
tesseract -o -l eng mybook.tif mybook
or:
2.
- use k2pdfopt to transform pdf to pdf formatted for e-reader
options:
-ocr t -ocrhmax 1.5 -ocrvis st
Version 1. was able to scan written text better, but lost images
Version 2. was able to keep images somehow, but characters are rendered a bit noisy.
In neither of the two approaches, I can have a TOC file - an hyperlinked index of the book.
Possibly, I would also like to remove words in pages headers - like the word "introduction" written in each top of the pages in the paper-book for the chapter "introduction.
I would like to reflow pdf to epub to:
- KEEP IMAGES
- be able to intervene on TOC to create an index
- possibly remove words in header of the page (could do with regex, eventually, or manually)
- reflow text to epub
- finally use calibre to handle epub > to kindle / e-readers
Could you advise what I am missing ?
How could I complete / edit the two approaches to have desired result?
|