MobileRead Forums - View Single Post

Tex2002ans · 03-01-2021, 02:43 PM

Quote:

Originally Posted by FDPuthuff

Tex2002ans, Thank you sir for trying to let cooler heads prevail.

Quote:

Originally Posted by FDPuthuff

I have been trying to wrap my head around regex and was just curious how it might be used to save time removing all the page numbers and header stuff which does not get converted into proper header info in the Word file.

Once you understand the WHY/HOW of regex + thinking in "search by pattern"... you'll be able to work much more efficiently.

For example, instead of doing dozens of individual searches for:

Find "Page 123"
Find "Page 124"
[...]
Find "Page 256"
Find "Tom" and change to "Smith"
- Now you'll get "Tomorrow" -> "Smithorrow"!!!
- And "Tomas's" -> "Smithas's"

Instead, regex lets you search for patterns/categories:

Find the word "Page" + followed by any numbers.
- Regex Search: Page \d+
- Page = find "Page"
- \d = any number
- + = "one or more"
Find "Tom" + with or without an apostrophe s.
- Regex Search: \bTom'*s*\b
- \b = make sure this is the edge of a word
- Tom = find "Tom"
- ' = find the apostrophe
- * = "zero or more"
- s = find the "s"
- * = "zero or more"
- \b = make sure this is the edge of a word
  - So now it'll only hit "Tom" + "Tom's"
  - and NOT "Tomorrow" + "Tomas"

Word's regex uses slightly different symbols from Calibre/Sigil, but all the same concepts apply.

Quote:

Originally Posted by FDPuthuff

Something my website does not mention, yet, is that I also do proof-listening for audio-books. I was sent audio files for a couple chapter of the book, along with PDF files to check the audio against.

Fantastic, fantastic.

You may also want to check out these audiobook talks given at ebookcraft 2019 (a yearly ebook conference):

(I'll get around to summarizing all the info from these talks one of these days... lol.)

Quote:

Originally Posted by FDPuthuff

I appreciate the concern, but it is truly just a question about something I figure I will run into as I am streamlining my process of converting PDFs.

If you only care about the text... k2pdfopt can crop headers/footers right out of the PDF.

Willus is the master there...

See his program/thread: "k2pdfopt: optimizes PDFs for viewing on e-readers".

He's extremely helpful/responsive, and has helped hundreds (thousands?) of people crop their PDFs.

You could also see some slightly related discussion/tangents in this thread:

"Optimize PDFs from archive.org for E-Ink devices"

But I'd learn the regex method. It'll be infinitely more efficient in the long-run (and applicable to actual editing/copyediting too!).

Quote:

Originally Posted by JSWolf

But why the hassle of converting from PDF when there is already a version of this eBook in reflowable format that you can get from Amazon?

PDF is usually the proof copy.

As another example:

I JUST completed Book #5 for an author... It was supposed to include 1 chapter from each of his previous 4 books (+ new Foreword/Intro).

Somewhere along the line, the ebook vs. print for #1-4 became wildly out of sync.

I call this the great "bifurcation". See my posts in:

The author only proofed the PDFs, and gave lots of wording/format changes here and there.

When I opened the EPUB/MOBIs, they were missing italics, em dashes, commas, bold label in the captions, etc.

So what should've been a simple copy/paste each HTML chapter from #1-4... became a mess.

And the only way to untangle it was to redo everything from the other formats.