Pandoc and Tesseract to keep images and TOC

gg4u · 11-18-2018, 09:59 AM

I achieve to convert a pdf book in text by:

1.
- use ghostscript and transform it to tif
- use tesseract to OCR the tif in txt
- use pandoc to convert txt to epub

gs -q -r600x600 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif sourcePDF.pdf -c quit

tesseract -o -l eng mybook.tif mybook

or:

2.
- use k2pdfopt to transform pdf to pdf formatted for e-reader

options:

-ocr t -ocrhmax 1.5 -ocrvis st

Version 1. was able to scan written text better, but lost images
Version 2. was able to keep images somehow, but characters are rendered a bit noisy.

In neither of the two approaches, I can have a TOC file - an hyperlinked index of the book.

Possibly, I would also like to remove words in pages headers - like the word "introduction" written in each top of the pages in the paper-book for the chapter "introduction.

I would like to reflow pdf to epub to:
- KEEP IMAGES
- be able to intervene on TOC to create an index
- possibly remove words in header of the page (could do with regex, eventually, or manually)
- reflow text to epub
- finally use calibre to handle epub > to kindle / e-readers

Could you advise what I am missing ?

How could I complete / edit the two approaches to have desired result?

j.p.s · 11-18-2018, 03:30 PM

Have you tried using asciidoc (or the compatible asciidoctor) to convert the text into epub?

You have to apply some very light markup to the text to designate chapters. In return you automatically get hyperlinked table of contents.

gg4u · 11-19-2018, 09:40 AM

Hi Jps,

Quote:

Have you tried using asciidoc (or the compatible asciidoctor) to convert the text into epub?

whose option is asciidoc ? tessearct ? pandoc?

Quote:

You have to apply some very light markup to the text to designate chapters. In return you automatically get hyperlinked table of contents.

So I should do it manually on the .txt file after tesseract , right? Or can tessearct or pandoc guess right markup from white spaces between paragraphs?

Could you tell which is right markup?

I would like to KEEP IMAGES from ghostscript, like k2dpfopt attempts to do.

Does tessearact allow to keep images (or have some options to detect images and skip them from OCR processing) ?

j.p.s · 11-23-2018, 06:00 PM

Quote:

Originally Posted by gg4u

Hi Jps,

whose option is asciidoc ? tessearct ? pandoc?

Sorry for not being clear and not having time to elaborate until now.

asciidoc is a standalone python script that converts a very lightly marked up plain text file straight to either HTML, EPUB, or PDF with a single command each.

Basically, you put an "=" character at the front of the line with the title, "==" in front of each chapter heading, "===" in front of section titles, etc. Links, references, index, embedding and linking to images are all easy. Table of Contents, if desired, is automatically generated.

The rationale for asciidoc is at: https://asciidoctor.org/docs/what-is-asciidoc/

A reference for asciidoc markup is at: https://asciidoctor.org/docs/asciido...ick-reference/

I think the above is also suitable as a tutorial, but I have also just found http://www.vogella.com/tutorials/AsciiDoc/article.html which I think is relatively new; I had not seen it before.

asciidoc writer's guide: https://asciidoctor.org/docs/asciidoc-writers-guide/

(asciidoctor is a ruby utility that that converts asciidoc markup. I use whichever I prefer at the moment and sometimes switch back and forth. asciidoctor has pretty much taken over stewardship of asciidoc syntax.)

If you have a PDF with a text layer, extract that without using OCR. If there is no text layer, then you just need OCR to get plain text. Formatting would just get in the way.

gg4u · 11-24-2018, 07:33 AM

Thank you, j.p.s, your reference will be useful.

I still miss a step.

I am *creating* a document I want to convert in an epub.

I want to *convert* a pdf to an epub, and have as a final result:

- text that is rendered sharp (no OCR layer), it can be selected, highlighted and zoomable in the e-reader
- images such as photos, graphs and tables

I am doing the following:
1. I process a pdf of scanned images with ghostscript and convert it to tiff

gs -q -r600x600 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif sourcePDF.pdf -c quit

2. Apply tesseract to obtain txt
tesseract -o -l eng mybook.tif mybook

Or

Apply tesseract to obtain searchable pdf

Pros and Cons
With a txt I will have desired result on text, I can use asciidoctor for mark up,
but I miss extracted images.

With a searchable pdf, I see it contains desired images, but text is rendered as OCR plus there are the images containing text in background - I can't edit it to apply markup, text won't support the feature in the e-reader and it is not sharp (it looks like a rendered image).

I looked at the documnetation of asciidoctor, and have not found that I could process a searchable pdf *to* an epub - while it is clear I can process a txt to an epub.

It is less then desired to manually create and reference images on a txt, and then apply Asciidoctor - I miss a step.

Can I use the tools you suggest to *extract* text AND image catalog from the original file (tiff or searchable pdf) ?

I would need images ( photos, graphs, and tables ) be referenced on the text in output , as per:

https://asciidoctor.org/docs/asciido...ng-with-images

If asciidoc or asciidoctor won't extract images, which steps would you suggest to obtain final result - a txt file with references to extracted images, which I then could finalise with asciidoctor ?

j.p.s · 11-24-2018, 09:56 AM

Hi gg4u,

In addition to marking the title and chapter heading with "=" characters, it is necessary to insert references to images yourself. asciidoc is best for quickly making a nice finished document in multiple formats starting from nothing. It was my thought that it could help with a part of your process.

Discussions of all kinds of subjects on mobileread can be very contentious, but across all the various forums on mobileread there is widespread agreement that conversion from PDF to any other format has all kinds of problems and that there is no good way to automate it.

Hitch · 11-24-2018, 11:34 AM

Quote:

Originally Posted by j.p.s

Hi gg4u,

In addition to marking the title and chapter heading with "=" characters, it is necessary to insert references to images yourself. asciidoc is best for quickly making a nice finished document in multiple formats starting from nothing. It was my thought that it could help with a part of your process.

Discussions of all kinds of subjects on mobileread can be very contentious, but across all the various forums on mobileread there is widespread agreement that conversion from PDF to any other format has all kinds of problems and that there is no good way to automate it.

(Bold emphasis added)

And that's the bottom line. There is, quite simply, NO GOOD WAY to automate conversion from PDF.

We do this professionally--let me tell you what we do, after hundreds of experiments and thousands of books:

We scan the PDF using AbbyyFineReader;
We run OCR;
We clean the resulting Word file generated by Abbyy, using the red warning indicators as a guide.
We export a PDF from the cleaned Word file, and,
We run a compare against the original PDF.
We fix any differences between the two, in the Word file.
We then do a 2nd export to PDF, and lather-rinse-repeat with the PDF compare.
When we have a "perfect" pair of PDFs, then we stop with the OCR/Scan.
We then clean the Word file as we would from a typical source Word file, which means,
We're at the same exact spot we would have been, if a client had walked in the door with a Word file to begin with.

That's what we do. We've tried EVERY possible automated process, from those suggested by others, to some we've devised and created ourselves. This is the fastest, most accurate way we've found. I wish it weren't so, but this is the bottom line.

FWIW. I know it's not what you wanted to hear, but...there it is.

Hitch

kso · 11-26-2018, 08:55 AM

Why don't you try pdftotext, part of xpdf, and a standard application on linux (and probably others). It extracts whatever text is in the pdf and writes it to a plain text file avoiding the OCR/proofreading steps. You can even specify a crop area by giving it top/left coordinate and a width and height of the crop area to work on.

klaus

Hitch · 11-26-2018, 09:23 AM

Quote:

Originally Posted by kso

Why don't you try pdftotext, part of xpdf, and a standard application on linux (and probably others). It extracts whatever text is in the pdf and writes it to a plain text file avoiding the OCR/proofreading steps. You can even specify a crop area by giving it top/left coordinate and a width and height of the crop area to work on.

klaus

For the very reasons you mentioned--it exports plain text. I suppose if we received a simple PDF that was relatively plain text, and I didn't mind investing all the time needed to then go in and recode all the text formatting, that might be a way forward. But in our experience--and we've done quite literally thousands of PDF-->ePUB jobs--it takes longer to proof a PDF, line-by-line, and add back in the text formatting, than it does to Scan/OCR the file in the first place and do the work in the order we do it.

In other words, you do not avoid the proofreading step--you actually make it longer/worse, because you have to proof line-by-line, to find italics, bold, underscored text, blockquotes, etc. It's faster and easier to run two PDF Compare functions, to find differences between two PDFs, than it is to have to manually read the source PDF against the (now reformatted) text, to find and replace all text formatting. Laboriously long and tedious work to replace all the formatting, in term of the proofing.

And that assumes that it's something simple, like a novel. Once you move past novels, of course, it gets arithmetically worse.

As I stated in my post, we've tried pretty much every variant. We've tried "save to Word" from within Acrobat. We've tried a few of those "save your PDF to Word!" websites. We've tried many, if not all, of the "PDF2XXXX" programs or apps out there. All of them "work" to some extent or the other, but the bottom line is, for the level of accuracy that we need, as commercial formatters, and the amount of time, the scanning/OCR method still works best, both in terms of time expended and quality of result.

If we only had to do one, once in a while, then doing something like you suggest I suppose makes sense. But we probably have 50-100 PDF-to-ePUB/MOBI projects in production as I type this, and as I said, in our experiments, that's not been viable for us.

Hitch

Toxaris · 11-26-2018, 11:54 AM

Not only that Hitch, but pdftext only works for PDF's that already contain the text. A lot only contain scans of text and for those PDF's the result will be nothing.

Hitch · 11-26-2018, 02:59 PM

Quote:

Originally Posted by Toxaris

Not only that Hitch, but pdftext only works for PDF's that already contain the text. A lot only contain scans of text and for those PDF's the result will be nothing.

Yes, that too, you are absolutely right. I wish I saw fewer of those!

Hitch

kso · 11-27-2018, 09:31 AM

Quote:

Originally Posted by Hitch

... I suppose if we received a simple PDF that was relatively plain text, and I didn't mind investing all the time needed to then go in and recode all the text formatting, that might be a way forward...
Hitch

This is probably a very naive question: but why do people give you PDFs to work from, don't they have their own copy (or copy they have rights to) in a "reasonable" file format?

klaus

Hitch · 11-27-2018, 11:14 AM

Quote:

Originally Posted by kso

This is probably a very naive question: but why do people give you PDFs to work from, don't they have their own copy (or copy they have rights to) in a "reasonable" file format?

klaus

Nah, not naive. I'd have asked it too, a decade ago before I learned the (very) hard way. Authors have pdfs, that previous publishers made for them, or Authorhouse, or Createspace, or, or or. Or they paid a print layout person, who only gave them PDFs as the final product, and didn't give them the INDD package files. Or they were made with Quark--I still see this, today, frequently from Ireland, as it happens, and eastern Europe. Or, they created a pdf using Powerpoint, or AI, or whatever. Using the native Powerpoint files, or adobe illustrator, isn't going to be better than using a PDF. I see these types of files daily.

OR, (this slays me), they had a book in print, and they had it scanned by some bozo--a friend, or a copyshop, and you'd be appalled at how many times, TODAY, we get COPIES, not scans, so there's no text layer. I will bet you I see this 1-2x monthly.

Example: I had a client, who was trade pubbed. She wanted a quote to have her 3 books scanned, OCRed, cleaned, put into print layouts and eBooks. We gave her a quote which IMHO, was fairly cheap. (I mean...if you include competent scanning, proofing, cleanup). She decided it was too expensive. She had a local Kinkos, or whatever, "scan" the pages--and they were copies, not scans, albeit, saved to PDF with a text layer. (Truly, one of the MOST appalling scanning jobs I've ever seen. She kept going on and on about what a GREAT job they'd done, and I finally told her, "look, I'm sorry, I'm glad that you're happy with those folks and their efforts, but this scanning job is dreadfully poor quality." I'm 99% sure that at the time, she didn't believe me, thought I was trying to have her spend money she didn't need to with a high-quality scanner like Golden Images.) The quality was horrendous, and I mean, blurry, crooked, etc.

She decided she was going to DIY, right? So she started copy-pasting the text, from the "PDF" copy to Word. Within a very short time, she was in TEARS--missing text, weird formatting, all the usual crap, right?

And, of course, EVERY line ends in a soft return, as well. She can't figure out why she can't make the text justify! She finally gave up and sent us the "pasted" Word files, which we're cleaning up and formatting/laying out. NOW, she thinks that what she was quoted was cheap--it took her MONTHS (from the 3rd week in July, until 10 days ago--about the 15th of July) to paste, review, etc., ONE BOOK. One book! Nearly 110 days, to do ONE. Now, what she was quoted looks pretty damn reasonable. Funny how that works.

But, you can't tell people what's involved with it; they simply don't believe you. I mean, it's all done FOR you, right? You just "save to Word" and Bob's-yer-uncle, nothin' hard 'bout that! (Like people that think that they can upload any POS to the KDP and "it makes an eBook for you!")

You can't tell people, they have to try it themselves to figure out how hard it is to actually DO and do right. {shrug}. As with all things, it looks easy to anyone who's never had to do it.

And of course--there are also those people whom you stare at, wondering "have you ever SEEN a book?" They send you their DIY efforts, saying "I just need you to add a TOC," and the thing is godawful. (n.b.: we don't do that type of work, either working "in" Word or fixer-uppers.) For print, no running heads, no page numbers, unjustified, spacing between first-line-indented paragraphs, and so on and so on. Mind-boggling, really. Don't people bother to even read blog posts, from places like TheBookDesigner, to see what a book should LOOK LIKE, before they hit the publish button? BOGGLING.

Hitch

kso · 11-27-2018, 12:13 PM

Quote:

Originally Posted by Hitch

Nah, not naive. I'd have asked it too, a decade ago before I learned the (very) hard way...

Hitch

Bet you could write a book about it

Just today my wife told me, she has to send her and her siblings' bank details to a solicitor (lawyer) here, because of some inheritance. She was told, the details cannot, I repeat, cannot be stated in the body of the email. They must be in a document, such as word, sent as an attachment.

Sometimes it's really frightening how people deal with basic IT. Maybe the gene pool needs cleansing?

klaus

Hitch · 11-27-2018, 12:22 PM

Quote:

Originally Posted by kso

Bet you could write a book about it

Just today my wife told me, she has to send her and her siblings' bank details to a solicitor (lawyer) here, because of some inheritance. She was told, the details cannot, I repeat, cannot be stated in the body of the email. They must be in a document, such as word, sent as an attachment.

Sometimes it's really frightening how people deal with basic IT. Maybe the gene pool needs cleansing?

klaus

I hate to confess this, but I also tell my clients to NEVER send me passwords, login details, etc., in email, and if they MUST, to attach it in an email. It's not secure. Is it "better" to send it in an attachment? Not really, but at least I keep them from blasting it all across the globe in the body of the email. {shrug}.

I mean, otherwise, you end up inveigling all these scenarios, and with clients like mine--it's either call them on the phone (NOOOOOOOO!), or come up with some other thing. I can't say that I blame the attorney. I mean...my clients can't download from browsers, typically. They can't find their download folders, and most of them don't even know that they HAVE download folders. Hell, given that, how would you handle it? (Seriously.)

Hitch

11-18-2018, 09:59 AM	#1
gg4u Junior Member Posts: 7 Karma: 42206 Join Date: Nov 2018 Device: Kindle 8	Pandoc and Tesseract to keep images and TOC I achieve to convert a pdf book in text by: 1. - use ghostscript and transform it to tif - use tesseract to OCR the tif in txt - use pandoc to convert txt to epub gs -q -r600x600 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif sourcePDF.pdf -c quit tesseract -o -l eng mybook.tif mybook or: 2. - use k2pdfopt to transform pdf to pdf formatted for e-reader options: -ocr t -ocrhmax 1.5 -ocrvis st Version 1. was able to scan written text better, but lost images Version 2. was able to keep images somehow, but characters are rendered a bit noisy. In neither of the two approaches, I can have a TOC file - an hyperlinked index of the book. Possibly, I would also like to remove words in pages headers - like the word "introduction" written in each top of the pages in the paper-book for the chapter "introduction. I would like to reflow pdf to epub to: - KEEP IMAGES - be able to intervene on TOC to create an index - possibly remove words in header of the page (could do with regex, eventually, or manually) - reflow text to epub - finally use calibre to handle epub > to kindle / e-readers Could you advise what I am missing ? How could I complete / edit the two approaches to have desired result?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to run Tesseract (to Ocr Pdfs) on the Mac?	MarjaE	PDF	4	02-05-2018 06:46 PM
Anybody here know how to use pandoc to output XHTML?	carlosbcg	ePub	3	02-21-2013 08:57 PM
how to build toc when chapter headers are just images	cybmole	Sigil	17	04-02-2012 05:03 AM
Calibre and TOC with images	davidhburton	Calibre	9	11-02-2010 01:16 AM
Grafische Oberfläche für tesseract OCR - Anforderungen bitte	netseeker	Software	39	10-09-2010 04:48 AM

11-18-2018, 03:30 PM	#2
j.p.s Grand Sorcerer Posts: 5,278 Karma: 98804578 Join Date: Apr 2011 Device: pb360	Have you tried using asciidoc (or the compatible asciidoctor) to convert the text into epub? You have to apply some very light markup to the text to designate chapters. In return you automatically get hyperlinked table of contents.

11-24-2018, 07:33 AM	#5
gg4u Junior Member Posts: 7 Karma: 42206 Join Date: Nov 2018 Device: Kindle 8	Thank you, j.p.s, your reference will be useful. I still miss a step. I am creating a document I want to convert in an epub. I want to convert a pdf to an epub, and have as a final result: - text that is rendered sharp (no OCR layer), it can be selected, highlighted and zoomable in the e-reader - images such as photos, graphs and tables I am doing the following: 1. I process a pdf of scanned images with ghostscript and convert it to tiff gs -q -r600x600 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif sourcePDF.pdf -c quit 2. Apply tesseract to obtain txt tesseract -o -l eng mybook.tif mybook Or Apply tesseract to obtain searchable pdf Pros and Cons With a txt I will have desired result on text, I can use asciidoctor for mark up, but I miss extracted images. With a searchable pdf, I see it contains desired images, but text is rendered as OCR plus there are the images containing text in background - I can't edit it to apply markup, text won't support the feature in the e-reader and it is not sharp (it looks like a rendered image). I looked at the documnetation of asciidoctor, and have not found that I could process a searchable pdf to an epub - while it is clear I can process a txt to an epub. It is less then desired to manually create and reference images on a txt, and then apply Asciidoctor - I miss a step. Can I use the tools you suggest to extract text AND image catalog from the original file (tiff or searchable pdf) ? I would need images ( photos, graphs, and tables ) be referenced on the text in output , as per: https://asciidoctor.org/docs/asciido...ng-with-images If asciidoc or asciidoctor won't extract images, which steps would you suggest to obtain final result - a txt file with references to extracted images, which I then could finalise with asciidoctor ?

11-24-2018, 09:56 AM	#6
j.p.s Grand Sorcerer Posts: 5,278 Karma: 98804578 Join Date: Apr 2011 Device: pb360	Hi gg4u, In addition to marking the title and chapter heading with "=" characters, it is necessary to insert references to images yourself. asciidoc is best for quickly making a nice finished document in multiple formats starting from nothing. It was my thought that it could help with a part of your process. Discussions of all kinds of subjects on mobileread can be very contentious, but across all the various forums on mobileread there is widespread agreement that conversion from PDF to any other format has all kinds of problems and that there is no good way to automate it.

11-26-2018, 08:55 AM	#8
kso Enthusiast Posts: 47 Karma: 10 Join Date: Jun 2018 Location: UK Device: Android, iPad, iPod, kindle {keyboard,fire7,hdx8.9} kobo, Sony PRS 600	Why don't you try pdftotext, part of xpdf, and a standard application on linux (and probably others). It extracts whatever text is in the pdf and writes it to a plain text file avoiding the OCR/proofreading steps. You can even specify a crop area by giving it top/left coordinate and a width and height of the crop area to work on. klaus

11-26-2018, 11:54 AM	#10
Toxaris Wizard Posts: 4,520 Karma: 121692313 Join Date: Oct 2009 Location: Heemskerk, NL Device: PRS-T1, Kobo Touch, Kobo Aura	Not only that Hitch, but pdftext only works for PDF's that already contain the text. A lot only contain scans of text and for those PDF's the result will be nothing.

Advert

Advert