OCRing + EPUBing my first book: Tips? - Page 2

Shohreh · 07-14-2020, 08:33 AM

Thanks for the great input.

For some reason, the cover doesn't appear in the EPUB when read on my e-reader.

It's 322px wide and 500px tall. I tried JPG and PNG to no avail.

Is it something in LibreOffice?

Quoth · 07-14-2020, 09:41 AM

I add the cover to Calibre and use it to convert Docx to epub.
Then epub or mobi, azw or whatever.
I edit in LO Writer, saving/editing in odt format. I do an EXTRA save in docx for Calibre, and never open that in LO Writer as Writer will ALWAYS convert on load any non-odt file.

The LO Writer (or earlier years, MS Word) never ever has the cover in it.

I edit covers in The Gimp, in layers, stored native at about x4 the resolution for an ebook. I export various resolutions of jpg and png files for different purposes:
Upload to Amazon / Smashwords etc and thence to Kobo, Apple, B&N, Tolino etc. The uploaded ebook has a lower resolution cover added by Calibre.

Then a different jpg might be used for our blog or other promotional material
A paper version will use 300dpi, 400 dpi or 600dpi depending on process/quality and thus a larger book format needs a larger image as the DPI has to be the same.

The same Epub2 is uploaded to Amazon KDP and Smashwords, but Smashwords also gets a dual mobi (because they can't tell what Kindle their customers have) as well as maybe a .doc for additional formats.
Amazon does their own conversion into all their formats from the epub2, including fully enhanced typeset KFX (there is no reason why a Kindle can't have a FW update so azw renders the same). KFX is really about delivery and DRM.

The goal, usually achieved, is that azw, kfx, epub2 should all look about the same and the same as the view in LO Writer. Old Mobi should have at least serif, sans, mono all in normal, bold, italic, bold-italic, larger headings, correct justification, relatively similar relative offsets to non-body margins, TOC, page breaks and links corresponding to the epub2/azw.

Calibre does a good job, but it needs fed with a docx where the styles and TOC are done correctly.
I auto create the index to level 2 (the headings are all level1 or level2 and EVERYTHING not in the index / TOC is body level), copy to a plain text editor, paste back and format.
I put anchors ONLY at the start of a paragraph (each heading is also a paragraph) and then select each line of the text index and edit link. The anchor is entered just in the URL box, not via document browse, just putting # prefix. The anchors are all lowercase with no punctuation, spaces or accents, typically ch2, ch3 etc. Then Calibre makes the ebook NCX from that correctly formatted user index, which is also inline in the ebook.

Shohreh · 07-14-2020, 10:39 AM

Thanks for the tip.

For some reason, when Calibre converts the ODT into EPUB, the cover appears twice in the EPUB as displayed on the computer with SumatraPDF, one page after the other, but only once as expected on the e-reader. Oh, well.

Tex2002ans · 07-14-2020, 11:01 AM

Quote:

Originally Posted by Shohreh

For some reason, the cover doesn't appear in the EPUB when read on my e-reader.

How are you converting it from LibreOffice to EPUB?

Are you using Calibre? Or trying to use LibreOffice's built-in Export As > Export As EPUB?

* * *

If you just want a quick conversion "that just works":

Saving as a DOCX copy, then use Calibre to convert DOCX->EPUB.

Calibre should detect and convert the first image in the document as a cover.

Note: DOCX->EPUB works a little bit cleaner than ODT->EPUB. Of course, keep your source document as ODT, but only save as DOCX temporarily for the conversion.

Quote:

Originally Posted by Shohreh

Is it something in LibreOffice?

LibreOffice covers should also be working fine. If you press Export As > Export As EPUB, do you see this?

Click image for larger version

Name: LibreOffice.6.4.4.EPUB.Cover.png
Views: 342
Size: 5.5 KB
ID: 180597

Which version of LibreOffice do you have?

Shohreh · 07-14-2020, 11:45 AM

Directly from the ODT file.

Also, the ToC is totally different:

I'll try the ODT → DOCX → EPUB alternative.

Quoth · 07-15-2020, 10:50 AM

Quote:

Originally Posted by Shohreh

Thanks for the tip.

For some reason, when Calibre converts the ODT into EPUB, the cover appears twice in the EPUB as displayed on the computer with SumatraPDF, one page after the other, but only once as expected on the e-reader. Oh, well.

Don't ever import odt to Calibre. Do an EXTRA save as to docx. Import docx to Calibre.
There is a setting to NOT detect covers in docx, I think, otherwise the first image, whatever it is will replace the cover set manually in Metadata Browse for cover!
Don't include the cover in the actual wordprocessor file!

The plugin (older LO) or built in epub export in LO Writer is very poor compared to an extra Save As docx, import to Calibre.
Make sure page setup image properties are 'Tablet' to avoid resizing images.
Convert to epub2.
Do any other formats from the epub2.

Sarmat89 · 07-16-2020, 12:39 AM

Please, never-never-never use Tesseract or other headless OCR systems for books. All text must be proofed interactively. Also, that frontend is very primitive and it will mess up the text formatting.

roger64 · 07-16-2020, 07:11 AM

Quote:

Originally Posted by Sarmat89

Please, never-never-never use Tesseract or other headless OCR systems for books. All text must be proofed interactively. Also, that frontend is very primitive and it will mess up the text formatting.

After over one year of exclusive use of Tesseract (about 50 ebooks), I strongly disagree.

Tesseract 4.11, coupled with the latest tessdata 2.4. (ENG and FRA tested) is quite able to ocr efficiently any book.

With a good quality scan, you can even ocr directly a full book (about 30 pages minute) and save in text format. The graphic interface (gImageReader-qt5) is quite clean.
- first you can proofread your text line by line
- with a click, the text is changed into paragraphs interspersed with empty ones.
Roughly I would say, on average, you may have one mistake a page (including accents, punctuation).

Cons

No italics, no anchors that need to be set up manually.
Garbage for full white pages (?)

Free tip

If you have a white text on a black background, Tesseract will give you a blank page. So, open a terminal and use imagemagick first with this command (adapt as needed), then proceed as usual.

Code:

convert name-image.jpg -channel RGB -negate output.jpg

Shohreh · 07-18-2020, 04:50 PM

This time, I'm trying to convert a PDF to EPUB.

Lucky me, gImageReader says: "PDFs with text: These PDF files already contain text".

Indeed, when opening the file in Windows, the text is copy/pastable with the mouse, so it's not scanned images.

FWIW, here's what cpdf says about it:

Code:

XMP pdf:Producer: Adobe Acrobat 10.0 Paper Capture Plug-in with ClearScan
XMP xmp:CreatorTool: Canon

What would you recommend I do to turn it into an EPUB?

Tex2002ans · 07-18-2020, 11:27 PM

Quote:

Originally Posted by Shohreh

This time, I'm trying to convert a PDF to EPUB.

[...]

Code:

XMP pdf:Producer: Adobe Acrobat 10.0 Paper Capture Plug-in with ClearScan

What would you recommend I do to turn it into an EPUB?

ClearScan is just one of Adobe's technologies to clean a scan by replacing the actual bitmaps with generated "custom fonts". It may look like a purely digital file, but in reality it's still a scanned document.

For more info, see: https://blogs.adobe.com/acrolaw/2009...rscan_is_smal/

All OCR errors and usual PDF->EPUB recommendations still apply.

Shohreh · 07-19-2020, 05:56 AM

That's why it looked like scanned pages, but the text is still selectable like text PDF.

I'll play with Sigil and see if it's more convenient to build an EPUB than LibreOffice Writer. The mid-page carriage returns are especially annoying.

BetterRed · 07-19-2020, 06:55 AM

Quote:

Originally Posted by Shohreh

The mid-page carriage returns are especially annoying.

Transtools Unbreaker tool (Word Add in) can fix most of those, including when they're in tables

BR

Shohreh · 07-19-2020, 07:06 AM

Thanks. Interesting that there's no free alternative for eg. LibreOffice. I guess it's harder than it looks.

http://www.translatortools.net/produ...ools/unbreaker

--
Edit: Opening the PDF in Abbyy FineReader does a pretty good job. Gone are the mid-sentence linebreaks (on a few test pages, at least).

An AutoIT script might come in useful to automate the process.

Shohreh · 07-19-2020, 04:12 PM

Incidently, if a PDF contains two layers (scanned pages as bitmaps, and OCRed text), is there an application that can extract just the text layer, so I can open it Sigli or LibreOffice?

I checked cpdf, mutool, and qpdf, but saw no obvious command, even just to list layers.

j.p.s · 07-19-2020, 05:28 PM

Quote:

Originally Posted by Shohreh

Incidently, if a PDF contains two layers (scanned pages as bitmaps, and OCRed text), is there an application that can extract just the text layer, so I can open it Sigli or LibreOffice?

I checked cpdf, mutool, and qpdf, but saw no obvious command, even just to list layers.

pdftotext:
https://en.wikipedia.org/wiki/Pdftotext

Also, k2pdfopt, documented in the PDF forum at mobileread.

07-14-2020, 09:41 AM	#17
Quoth Still reading Posts: 15,602 Karma: 114630515 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper	I add the cover to Calibre and use it to convert Docx to epub. Then epub or mobi, azw or whatever. I edit in LO Writer, saving/editing in odt format. I do an EXTRA save in docx for Calibre, and never open that in LO Writer as Writer will ALWAYS convert on load any non-odt file. The LO Writer (or earlier years, MS Word) never ever has the cover in it. I edit covers in The Gimp, in layers, stored native at about x4 the resolution for an ebook. I export various resolutions of jpg and png files for different purposes: Upload to Amazon / Smashwords etc and thence to Kobo, Apple, B&N, Tolino etc. The uploaded ebook has a lower resolution cover added by Calibre. Then a different jpg might be used for our blog or other promotional material A paper version will use 300dpi, 400 dpi or 600dpi depending on process/quality and thus a larger book format needs a larger image as the DPI has to be the same. The same Epub2 is uploaded to Amazon KDP and Smashwords, but Smashwords also gets a dual mobi (because they can't tell what Kindle their customers have) as well as maybe a .doc for additional formats. Amazon does their own conversion into all their formats from the epub2, including fully enhanced typeset KFX (there is no reason why a Kindle can't have a FW update so azw renders the same). KFX is really about delivery and DRM. The goal, usually achieved, is that azw, kfx, epub2 should all look about the same and the same as the view in LO Writer. Old Mobi should have at least serif, sans, mono all in normal, bold, italic, bold-italic, larger headings, correct justification, relatively similar relative offsets to non-body margins, TOC, page breaks and links corresponding to the epub2/azw. Calibre does a good job, but it needs fed with a docx where the styles and TOC are done correctly. I auto create the index to level 2 (the headings are all level1 or level2 and EVERYTHING not in the index / TOC is body level), copy to a plain text editor, paste back and format. I put anchors ONLY at the start of a paragraph (each heading is also a paragraph) and then select each line of the text index and edit link. The anchor is entered just in the URL box, not via document browse, just putting # prefix. The anchors are all lowercase with no punctuation, spaces or accents, typically ch2, ch3 etc. Then Calibre makes the ebook NCX from that correctly formatted user index, which is also inline in the ebook. Last edited by Quoth; 07-14-2020 at 09:56 AM.

07-18-2020, 04:50 PM	#24
Shohreh Addict Posts: 236 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	This time, I'm trying to convert a PDF to EPUB. Lucky me, gImageReader says: "PDFs with text: These PDF files already contain text". Indeed, when opening the file in Windows, the text is copy/pastable with the mouse, so it's not scanned images. FWIW, here's what cpdf says about it: Code: XMP pdf:Producer: Adobe Acrobat 10.0 Paper Capture Plug-in with ClearScan XMP xmp:CreatorTool: Canon What would you recommend I do to turn it into an EPUB?

07-19-2020, 05:56 AM	#26
Shohreh Addict Posts: 236 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	That's why it looked like scanned pages, but the text is still selectable like text PDF. I'll play with Sigil and see if it's more convenient to build an EPUB than LibreOffice Writer. The mid-page carriage returns are especially annoying. Attached Thumbnails

07-19-2020, 07:06 AM	#28
Shohreh Addict Posts: 236 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Thanks. Interesting that there's no free alternative for eg. LibreOffice. I guess it's harder than it looks. http://www.translatortools.net/produ...ools/unbreaker -- Edit: Opening the PDF in Abbyy FineReader does a pretty good job. Gone are the mid-sentence linebreaks (on a few test pages, at least). An AutoIT script might come in useful to automate the process. Attached Thumbnails Last edited by Shohreh; 07-19-2020 at 10:05 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
An advice on OCRing, please.	nlundberg	Workshop	6	03-13-2013 06:29 AM
Book Designer Hints and Tips	Patricia	Workshop	59	06-10-2010 07:14 AM

07-14-2020, 08:33 AM	#16
Shohreh Addict Posts: 236 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Thanks for the great input. For some reason, the cover doesn't appear in the EPUB when read on my e-reader. It's 322px wide and 500px tall. I tried JPG and PNG to no avail. Is it something in LibreOffice?

07-14-2020, 10:39 AM	#18
Shohreh Addict Posts: 236 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Thanks for the tip. For some reason, when Calibre converts the ODT into EPUB, the cover appears twice in the EPUB as displayed on the computer with SumatraPDF, one page after the other, but only once as expected on the e-reader. Oh, well.

07-14-2020, 11:45 AM	#20
Shohreh Addict Posts: 236 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Directly from the ODT file. Also, the ToC is totally different: I'll try the ODT → DOCX → EPUB alternative.

07-16-2020, 12:39 AM	#22
Sarmat89 Fanatic Posts: 531 Karma: 2268308 Join Date: Nov 2015 Device: none	Please, never-never-never use Tesseract or other headless OCR systems for books. All text must be proofed interactively. Also, that frontend is very primitive and it will mess up the text formatting.

07-19-2020, 04:12 PM	#29
Shohreh Addict Posts: 236 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Incidently, if a PDF contains two layers (scanned pages as bitmaps, and OCRed text), is there an application that can extract just the text layer, so I can open it Sigli or LibreOffice? I checked cpdf, mutool, and qpdf, but saw no obvious command, even just to list layers.

Advert

Advert