View Single Post
Old 11-18-2018, 09:59 AM   #1
gg4u
Junior Member
gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'gg4u understands when you whisper 'The dog barks at midnight.'
 
Posts: 7
Karma: 42206
Join Date: Nov 2018
Device: Kindle 8
Pandoc and Tesseract to keep images and TOC

I achieve to convert a pdf book in text by:

1.
- use ghostscript and transform it to tif
- use tesseract to OCR the tif in txt
- use pandoc to convert txt to epub


gs -q -r600x600 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif sourcePDF.pdf -c quit

tesseract -o -l eng mybook.tif mybook


or:

2.
- use k2pdfopt to transform pdf to pdf formatted for e-reader

options:

-ocr t -ocrhmax 1.5 -ocrvis st



Version 1. was able to scan written text better, but lost images
Version 2. was able to keep images somehow, but characters are rendered a bit noisy.

In neither of the two approaches, I can have a TOC file - an hyperlinked index of the book.

Possibly, I would also like to remove words in pages headers - like the word "introduction" written in each top of the pages in the paper-book for the chapter "introduction.


I would like to reflow pdf to epub to:
- KEEP IMAGES
- be able to intervene on TOC to create an index
- possibly remove words in header of the page (could do with regex, eventually, or manually)
- reflow text to epub
- finally use calibre to handle epub > to kindle / e-readers

Could you advise what I am missing ?

How could I complete / edit the two approaches to have desired result?
gg4u is offline   Reply With Quote