MobileRead Forums - View Single Post - Pandoc and Tesseract to keep images and TOC

j.p.s · 11-23-2018, 06:00 PM

Quote:

Originally Posted by gg4u

Hi Jps,

whose option is asciidoc ? tessearct ? pandoc?

Sorry for not being clear and not having time to elaborate until now.

asciidoc is a standalone python script that converts a very lightly marked up plain text file straight to either HTML, EPUB, or PDF with a single command each.

Basically, you put an "=" character at the front of the line with the title, "==" in front of each chapter heading, "===" in front of section titles, etc. Links, references, index, embedding and linking to images are all easy. Table of Contents, if desired, is automatically generated.

The rationale for asciidoc is at: https://asciidoctor.org/docs/what-is-asciidoc/

A reference for asciidoc markup is at: https://asciidoctor.org/docs/asciido...ick-reference/

I think the above is also suitable as a tutorial, but I have also just found http://www.vogella.com/tutorials/AsciiDoc/article.html which I think is relatively new; I had not seen it before.

asciidoc writer's guide: https://asciidoctor.org/docs/asciidoc-writers-guide/

(asciidoctor is a ruby utility that that converts asciidoc markup. I use whichever I prefer at the moment and sometimes switch back and forth. asciidoctor has pretty much taken over stewardship of asciidoc syntax.)

If you have a PDF with a text layer, extract that without using OCR. If there is no text layer, then you just need OCR to get plain text. Formatting would just get in the way.