View Single Post
Old 07-22-2023, 03:38 AM   #24
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Karellen View Post
1. So hunting around for new WIA compliant scanner software, I found this...
https://www.naps2.com/
Yes, NAPS2 is awesome. I've been using that for the past few years as well.

It:
  • helps gather/reorder images
  • can create a rough PDF from it
  • can even produce a quick OCR for you (using Tesseract).

I use it when I need to create a rough PDF from an actual scanner, or, to quickly crop/edit photos taken from a camera.

Like if my family gives me a small/short document to scan, I just use NAPS2 instead of busting out the full-blown editing + OCR tools!

Quote:
Originally Posted by Karellen View Post
I realise now I attempted this without the right knowledge and tools. So thanks for all the great pointers!!

If there is anything in my workflow that could be improved, please let me know.
Read the linked threads. There's years and years of knowledge I buried in there about every step of the workflow.

Edit in Word/LibreOffice (DOCX) or Sigil/Calibre (EPUB)?

In the DOCX stage, if that's where you prefer to do your edits...

LibreOffice has Regular Expressions, so if you know how to master those, you can do lots of mass corrections in there.

LibreOffice's Regex is SO MUCH better than Word's Wildcards... but it still has limitations. So...

Personally, I do all edits in Sigil/Calibre, because you have full access to:

And since you're working directly in HTML, nothing can hide from you.

For more on Regex + Spellcheck Lists, and even how to take advantage of some of this stuff in LibreOffice... see my post in:

If you follow the pyramid of links, it'll:
  • Summarize how/why they're helpful.
  • + link to many other MobileRead topics where I've written about it.

My Current PDF->EPUB Workflow

I settled on:
  • PDF -> Finereader to OCR
  • -> DOCX
  • -> Word / Toxaris's EPUB Tools
  • -> EPUB.

where:
  • PDF -> Finereader gives me fantastic OCR.
  • Finereader -> DOCX carries over most of the text/formatting.
    • Note: Finereader -> EPUB, at least in 12, was a little buggy, so you had potential to lose chunks of text/footnotes. Maybe things got better in 15+.
  • Toxaris's EPUB Tools are specifically built to fix lots of OCR/Finereader's quirks.
    • Merging split pages, fixing lists, fixing hard/soft hyphens, normalizing fonts/font sizes, removing font colors, [...].
    • (This saves TONS OF TIME from manual cleanup of simple OCR/formatting errors.)
  • Toxaris -> EPUB = incredibly clean HTML, carrying over the Styles + leaving you with barebones formatting (<h1>s + <i> + <b>).

This gives me extremely clean HTML code—with almost all the trash removed—so when I begin editing EPUB, I can focus purely on:
  • fixing the text
  • + reintroducing actual formatting I wanted to maintain, like blockquotes.

Cutting down on all the wasted in-between cleanup/repairing time drastically.

- - -

Side Note: Sadly, Toxaris's EPUB Tools is now abandoned + will not be getting support (or the much-anticipated version 2 release).

I did recover and share one of the final versions of EPUBTools (v1.27.1) in:

You could also still read Toxaris's original "EPUBTools" MobileRead thread or visit his (now-dead) website via Archive.org:

The instant I finally gave in and began using this, it fully converted me. It was just SO MUCH BETTER than the manual cleanup I was doing before.

And the "Dialogue Checker" alone is the best dang thing since sliced bread:

To even APPROXIMATE that same type of "find the mismatching quotation marks" functionality... this is the kind of steps + Regexes you'd need to use:

and that still doesn't even get close to what Toxaris solved with his amazing cleanup tool.

- - -

Side Note #2: If you want more random EPUB productivity tips, also see my posts in:

Last edited by Tex2002ans; 07-23-2023 at 01:48 AM.
Tex2002ans is offline   Reply With Quote