11-18-2018, 09:59 AM | #1 |
Junior Member
Posts: 7
Karma: 42206
Join Date: Nov 2018
Device: Kindle 8
|
Pandoc and Tesseract to keep images and TOC
I achieve to convert a pdf book in text by:
1. - use ghostscript and transform it to tif - use tesseract to OCR the tif in txt - use pandoc to convert txt to epub gs -q -r600x600 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif sourcePDF.pdf -c quit tesseract -o -l eng mybook.tif mybook or: 2. - use k2pdfopt to transform pdf to pdf formatted for e-reader options: -ocr t -ocrhmax 1.5 -ocrvis st Version 1. was able to scan written text better, but lost images Version 2. was able to keep images somehow, but characters are rendered a bit noisy. In neither of the two approaches, I can have a TOC file - an hyperlinked index of the book. Possibly, I would also like to remove words in pages headers - like the word "introduction" written in each top of the pages in the paper-book for the chapter "introduction. I would like to reflow pdf to epub to: - KEEP IMAGES - be able to intervene on TOC to create an index - possibly remove words in header of the page (could do with regex, eventually, or manually) - reflow text to epub - finally use calibre to handle epub > to kindle / e-readers Could you advise what I am missing ? How could I complete / edit the two approaches to have desired result? |
11-18-2018, 03:30 PM | #2 |
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
Have you tried using asciidoc (or the compatible asciidoctor) to convert the text into epub?
You have to apply some very light markup to the text to designate chapters. In return you automatically get hyperlinked table of contents. |
Advert | |
|
11-19-2018, 09:40 AM | #3 | ||
Junior Member
Posts: 7
Karma: 42206
Join Date: Nov 2018
Device: Kindle 8
|
Hi Jps,
Quote:
Quote:
Could you tell which is right markup? I would like to KEEP IMAGES from ghostscript, like k2dpfopt attempts to do. Does tessearact allow to keep images (or have some options to detect images and skip them from OCR processing) ? |
||
11-23-2018, 06:00 PM | #4 |
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
Sorry for not being clear and not having time to elaborate until now.
asciidoc is a standalone python script that converts a very lightly marked up plain text file straight to either HTML, EPUB, or PDF with a single command each. Basically, you put an "=" character at the front of the line with the title, "==" in front of each chapter heading, "===" in front of section titles, etc. Links, references, index, embedding and linking to images are all easy. Table of Contents, if desired, is automatically generated. The rationale for asciidoc is at: https://asciidoctor.org/docs/what-is-asciidoc/ A reference for asciidoc markup is at: https://asciidoctor.org/docs/asciido...ick-reference/ I think the above is also suitable as a tutorial, but I have also just found http://www.vogella.com/tutorials/AsciiDoc/article.html which I think is relatively new; I had not seen it before. asciidoc writer's guide: https://asciidoctor.org/docs/asciidoc-writers-guide/ (asciidoctor is a ruby utility that that converts asciidoc markup. I use whichever I prefer at the moment and sometimes switch back and forth. asciidoctor has pretty much taken over stewardship of asciidoc syntax.) If you have a PDF with a text layer, extract that without using OCR. If there is no text layer, then you just need OCR to get plain text. Formatting would just get in the way. |
11-24-2018, 07:33 AM | #5 |
Junior Member
Posts: 7
Karma: 42206
Join Date: Nov 2018
Device: Kindle 8
|
Thank you, j.p.s, your reference will be useful.
I still miss a step. I am *creating* a document I want to convert in an epub. I want to *convert* a pdf to an epub, and have as a final result: - text that is rendered sharp (no OCR layer), it can be selected, highlighted and zoomable in the e-reader - images such as photos, graphs and tables I am doing the following: 1. I process a pdf of scanned images with ghostscript and convert it to tiff gs -q -r600x600 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif sourcePDF.pdf -c quit 2. Apply tesseract to obtain txt tesseract -o -l eng mybook.tif mybook Or Apply tesseract to obtain searchable pdf Pros and Cons With a txt I will have desired result on text, I can use asciidoctor for mark up, but I miss extracted images. With a searchable pdf, I see it contains desired images, but text is rendered as OCR plus there are the images containing text in background - I can't edit it to apply markup, text won't support the feature in the e-reader and it is not sharp (it looks like a rendered image). I looked at the documnetation of asciidoctor, and have not found that I could process a searchable pdf *to* an epub - while it is clear I can process a txt to an epub. It is less then desired to manually create and reference images on a txt, and then apply Asciidoctor - I miss a step. Can I use the tools you suggest to *extract* text AND image catalog from the original file (tiff or searchable pdf) ? I would need images ( photos, graphs, and tables ) be referenced on the text in output , as per: https://asciidoctor.org/docs/asciido...ng-with-images If asciidoc or asciidoctor won't extract images, which steps would you suggest to obtain final result - a txt file with references to extracted images, which I then could finalise with asciidoctor ? |
Advert | |
|
11-24-2018, 09:56 AM | #6 |
Grand Sorcerer
Posts: 5,278
Karma: 98804578
Join Date: Apr 2011
Device: pb360
|
Hi gg4u,
In addition to marking the title and chapter heading with "=" characters, it is necessary to insert references to images yourself. asciidoc is best for quickly making a nice finished document in multiple formats starting from nothing. It was my thought that it could help with a part of your process. Discussions of all kinds of subjects on mobileread can be very contentious, but across all the various forums on mobileread there is widespread agreement that conversion from PDF to any other format has all kinds of problems and that there is no good way to automate it. |
11-24-2018, 11:34 AM | #7 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
And that's the bottom line. There is, quite simply, NO GOOD WAY to automate conversion from PDF. We do this professionally--let me tell you what we do, after hundreds of experiments and thousands of books:
That's what we do. We've tried EVERY possible automated process, from those suggested by others, to some we've devised and created ourselves. This is the fastest, most accurate way we've found. I wish it weren't so, but this is the bottom line. FWIW. I know it's not what you wanted to hear, but...there it is. Hitch |
|
11-26-2018, 08:55 AM | #8 |
Enthusiast
Posts: 47
Karma: 10
Join Date: Jun 2018
Location: UK
Device: Android, iPad, iPod, kindle {keyboard,fire7,hdx8.9} kobo, Sony PRS 600
|
Why don't you try pdftotext, part of xpdf, and a standard application on linux (and probably others). It extracts whatever text is in the pdf and writes it to a plain text file avoiding the OCR/proofreading steps. You can even specify a crop area by giving it top/left coordinate and a width and height of the crop area to work on.
klaus |
11-26-2018, 09:23 AM | #9 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
In other words, you do not avoid the proofreading step--you actually make it longer/worse, because you have to proof line-by-line, to find italics, bold, underscored text, blockquotes, etc. It's faster and easier to run two PDF Compare functions, to find differences between two PDFs, than it is to have to manually read the source PDF against the (now reformatted) text, to find and replace all text formatting. Laboriously long and tedious work to replace all the formatting, in term of the proofing. And that assumes that it's something simple, like a novel. Once you move past novels, of course, it gets arithmetically worse. As I stated in my post, we've tried pretty much every variant. We've tried "save to Word" from within Acrobat. We've tried a few of those "save your PDF to Word!" websites. We've tried many, if not all, of the "PDF2XXXX" programs or apps out there. All of them "work" to some extent or the other, but the bottom line is, for the level of accuracy that we need, as commercial formatters, and the amount of time, the scanning/OCR method still works best, both in terms of time expended and quality of result. If we only had to do one, once in a while, then doing something like you suggest I suppose makes sense. But we probably have 50-100 PDF-to-ePUB/MOBI projects in production as I type this, and as I said, in our experiments, that's not been viable for us. Hitch |
|
11-26-2018, 11:54 AM | #10 |
Wizard
Posts: 4,520
Karma: 121692313
Join Date: Oct 2009
Location: Heemskerk, NL
Device: PRS-T1, Kobo Touch, Kobo Aura
|
Not only that Hitch, but pdftext only works for PDF's that already contain the text. A lot only contain scans of text and for those PDF's the result will be nothing.
|
11-26-2018, 02:59 PM | #11 |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
|
11-27-2018, 09:31 AM | #12 | |
Enthusiast
Posts: 47
Karma: 10
Join Date: Jun 2018
Location: UK
Device: Android, iPad, iPod, kindle {keyboard,fire7,hdx8.9} kobo, Sony PRS 600
|
Quote:
klaus |
|
11-27-2018, 11:14 AM | #13 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
OR, (this slays me), they had a book in print, and they had it scanned by some bozo--a friend, or a copyshop, and you'd be appalled at how many times, TODAY, we get COPIES, not scans, so there's no text layer. I will bet you I see this 1-2x monthly. Example: I had a client, who was trade pubbed. She wanted a quote to have her 3 books scanned, OCRed, cleaned, put into print layouts and eBooks. We gave her a quote which IMHO, was fairly cheap. (I mean...if you include competent scanning, proofing, cleanup). She decided it was too expensive. She had a local Kinkos, or whatever, "scan" the pages--and they were copies, not scans, albeit, saved to PDF with a text layer. (Truly, one of the MOST appalling scanning jobs I've ever seen. She kept going on and on about what a GREAT job they'd done, and I finally told her, "look, I'm sorry, I'm glad that you're happy with those folks and their efforts, but this scanning job is dreadfully poor quality." I'm 99% sure that at the time, she didn't believe me, thought I was trying to have her spend money she didn't need to with a high-quality scanner like Golden Images.) The quality was horrendous, and I mean, blurry, crooked, etc. She decided she was going to DIY, right? So she started copy-pasting the text, from the "PDF" copy to Word. Within a very short time, she was in TEARS--missing text, weird formatting, all the usual crap, right? And, of course, EVERY line ends in a soft return, as well. She can't figure out why she can't make the text justify! She finally gave up and sent us the "pasted" Word files, which we're cleaning up and formatting/laying out. NOW, she thinks that what she was quoted was cheap--it took her MONTHS (from the 3rd week in July, until 10 days ago--about the 15th of July) to paste, review, etc., ONE BOOK. One book! Nearly 110 days, to do ONE. Now, what she was quoted looks pretty damn reasonable. Funny how that works. But, you can't tell people what's involved with it; they simply don't believe you. I mean, it's all done FOR you, right? You just "save to Word" and Bob's-yer-uncle, nothin' hard 'bout that! (Like people that think that they can upload any POS to the KDP and "it makes an eBook for you!") You can't tell people, they have to try it themselves to figure out how hard it is to actually DO and do right. {shrug}. As with all things, it looks easy to anyone who's never had to do it. And of course--there are also those people whom you stare at, wondering "have you ever SEEN a book?" They send you their DIY efforts, saying "I just need you to add a TOC," and the thing is godawful. (n.b.: we don't do that type of work, either working "in" Word or fixer-uppers.) For print, no running heads, no page numbers, unjustified, spacing between first-line-indented paragraphs, and so on and so on. Mind-boggling, really. Don't people bother to even read blog posts, from places like TheBookDesigner, to see what a book should LOOK LIKE, before they hit the publish button? BOGGLING. Hitch |
|
11-27-2018, 12:13 PM | #14 | |
Enthusiast
Posts: 47
Karma: 10
Join Date: Jun 2018
Location: UK
Device: Android, iPad, iPod, kindle {keyboard,fire7,hdx8.9} kobo, Sony PRS 600
|
Quote:
Just today my wife told me, she has to send her and her siblings' bank details to a solicitor (lawyer) here, because of some inheritance. She was told, the details cannot, I repeat, cannot be stated in the body of the email. They must be in a document, such as word, sent as an attachment. Sometimes it's really frightening how people deal with basic IT. Maybe the gene pool needs cleansing? klaus |
|
11-27-2018, 12:22 PM | #15 | |
Bookmaker & Cat Slave
Posts: 11,462
Karma: 158448243
Join Date: Apr 2010
Location: Phoenix, AZ
Device: K2, iPad, KFire, PPW, Voyage, NookColor. 2 Droid, Oasis, Boox Note2
|
Quote:
I mean, otherwise, you end up inveigling all these scenarios, and with clients like mine--it's either call them on the phone (NOOOOOOOO!), or come up with some other thing. I can't say that I blame the attorney. I mean...my clients can't download from browsers, typically. They can't find their download folders, and most of them don't even know that they HAVE download folders. Hell, given that, how would you handle it? (Seriously.) Hitch |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to run Tesseract (to Ocr Pdfs) on the Mac? | MarjaE | 4 | 02-05-2018 06:46 PM | |
Anybody here know how to use pandoc to output XHTML? | carlosbcg | ePub | 3 | 02-21-2013 08:57 PM |
how to build toc when chapter headers are just images | cybmole | Sigil | 17 | 04-02-2012 05:03 AM |
Calibre and TOC with images | davidhburton | Calibre | 9 | 11-02-2010 01:16 AM |
Grafische Oberfläche für tesseract OCR - Anforderungen bitte | netseeker | Software | 39 | 10-09-2010 04:48 AM |