View Single Post
Old 10-14-2024, 10:17 PM   #20
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by mikapanja View Post
Is there a way to find (and possibly highlight) repeated word groups in ePubs?

[...]

I'd like it to find non-adjacent repeated word groups, i.e scattered throughout the text.
Not directly...

But like Doitsu pointed out, what you want to look for is called:
  • n-grams
    • These are "X number of words in a row".

So:
  • 4-grams = 4 words in a row
  • 3-grams = 3 words in a row
  • 2-grams = 2 words in a row
  • 1-gram = 1 word in a row
    • This is just Spellcheck Lists! A list of every single word (+ its # of hits) in the book!
      • In Calibre: Tools > Check Spelling (Alt+F7)
      • In Sigil: Tools > Spellcheck > Spellcheck (Ctrl+Alt+Q)

You can also use Calibre to temporarily convert your book to a TXT, and then there are plenty of "n-gram" tools out there to try and test out.

- - -

Side Note: I've written about "List-Based Spellchecking" + n-grams in detail, and have been using this to rip apart + edit books... for over 10 years now.

For some of my recent posts, see:

I cover stuff like how I use Spellcheck Lists to catch:
  • Typos
  • All "Foreign Words"
  • Mismatching Accents
  • Misspelled Names
  • Inconsistent Hyphenation

then how I use n-grams to catch repetitious repetitions throughout the books!

I also use Regular Expressions to quickly catch/refine/clean up a lot of this repetitious crap too!

- - -

Side Note #2: I even gave a talk about this last year in the:

- - -

Side Note #3: If you're interested, just last week I wrote an "article" on how I use n-grams.

This past month, I've been working on (conversion+proofing of) a 450k word beast of an ebook...

The author wanted me to copyedit/proofread, so I:
  • generated an n-grams spreadsheet
  • + wrote up a breakdown of how I use n-grams (with real-life examples from the book).

Here's a little sample:

- - - - - - - - - -

N-grams

These show you how many times you "repeat a phrase"/"chunk of words".

So a list of "3-grams" would show you every "chunk of 3 words in a row".

So if you took:
  • Show an example sentence with an example sentence.

and ran 3-grams on it, the output would show:
  • 2 an example sentence
  • 1 Show an example
  • 1 example sentence with
  • 1 sentence with an
  • 1 with an example

You repeated "an example sentence" twice!

When you run this across the entire book, these "repetitive patterns" pop right out!

How I Use Them

1. I start with the biggest n-grams first...
• Then work my way down.
• 6-grams, 5-grams, 4-grams, ...
2. When I find an interesting phrase + high number...
• I search the entire book for it.
3. I read the sentence...
• Use this to chop/refine!
• Fix/reword sentences as needed.
4. Repeat Step 2 in passes.

In your case, we can skip the 7-grams and 6-grams (it's mostly just these super-long titles like "Chairman of the Joint Chiefs of Staff").

5-grams is where we start seeing really interesting patterns.

[... It then goes through 5-grams, 4-grams, 3-grams, 2-grams... showing the types of things/patterns that can be found with each. ...]

Last edited by Tex2002ans; 10-14-2024 at 11:04 PM.
Tex2002ans is offline   Reply With Quote