Quote:
Originally Posted by mikapanja
Is there a way to find (and possibly highlight) repeated word groups in ePubs?
[...]
I'd like it to find non-adjacent repeated word groups, i.e scattered throughout the text.
|
Not directly...
But like Doitsu pointed out, what you want to look for is called:
- n-grams
- These are "X number of words in a row".
So:
- 4-grams = 4 words in a row
- 3-grams = 3 words in a row
- 2-grams = 2 words in a row
- 1-gram = 1 word in a row
- This is just Spellcheck Lists! A list of every single word (+ its # of hits) in the book!
- In Calibre: Tools > Check Spelling (Alt+F7)
- In Sigil: Tools > Spellcheck > Spellcheck (Ctrl+Alt+Q)
You can also use Calibre to temporarily convert your book to a TXT, and then there are plenty of "n-gram" tools out there to try and test out.
- - -
Side Note: I've written about "List-Based Spellchecking" + n-grams in detail, and have been using this to rip apart + edit books... for over 10 years now.
For some of my recent posts, see:
I cover stuff like how I use Spellcheck Lists to catch:
- Typos
- All "Foreign Words"
- Mismatching Accents
- Misspelled Names
- Inconsistent Hyphenation
then how I use n-grams to catch repetitious repetitions throughout the books!
I also use Regular Expressions to quickly
catch/refine/clean up a lot of this
repetitious crap too!
- - -
Side Note #2: I even gave a talk about this last year in the:
- - -
Side Note #3: If you're interested, just last week I wrote an "article" on how I use n-grams.
This past month, I've been working on (conversion+proofing of) a 450k word beast of an ebook...
The author wanted me to copyedit/proofread, so I:
- generated an n-grams spreadsheet
- + wrote up a breakdown of how I use n-grams (with real-life examples from the book).
Here's a little sample:
- - - - - - - - - -
N-grams
These show you how many times you "repeat a phrase"/"chunk of words".
So a list of "3-grams" would show you every "chunk of 3 words in a row".
So if you took:
- Show an example sentence with an example sentence.
and ran 3-grams on it, the output would show:
- 2 an example sentence
- 1 Show an example
- 1 example sentence with
- 1 sentence with an
- 1 with an example
You repeated "an example sentence" twice!
When you run this across the entire book, these "repetitive patterns" pop right out!
How I Use Them
1. I start with the biggest n-grams first...
• Then work my way down.
• 6-grams, 5-grams, 4-grams, ...
2. When I find an interesting phrase + high number...
• I search the entire book for it.
3. I read the sentence...
• Use this to chop/refine!
• Fix/reword sentences as needed.
4. Repeat Step 2 in passes.
In your case, we can skip the 7-grams and 6-grams (it's mostly just these super-long titles like "Chairman of the Joint Chiefs of Staff").
5-grams is where we start seeing really interesting patterns.
[... It then goes through 5-grams, 4-grams, 3-grams, 2-grams... showing the types of things/patterns that can be found with each.

...]