MobileRead Forums - View Single Post

Tex2002ans · 05-03-2014, 08:48 PM

Quote:

Originally Posted by DebbyS

[...] but will also check into "regex" to see what it is and if I can use it as well

"Regex" = shorthand for "Regular Expressions".

It is a way to do "variable searches". So you can do things like:

Search: ([0-9])-([0-9])
Replace: \1–\2

Which says "Look for a number 0 through 9 and 'capture it' in \1 + hyphen + a number 0 through 9 and 'capture it' in \2."

Replace it with "the number that was captured in \1 + EN DASH + whatever number was captured in \2".

Or I also use:

Search: [ ][b-z][ ]

Which says "Look for a SPACE + a single lowercase letter 'b' through 'z' + SPACE".

Typically in english, the only letter that is lowercase that is by itself is the word "a". Besides that, it is most likely an OCR error.

Or I also use this one:

Search: [0-9]{5,}

Which says "look for 5 or more numbers in a row".

Usually only Zip Codes are 5 digits or more, but in all the other cases, it is usually a missing punctuation mark in a large number due to the OCR. For example "20000" -> "20,000".

With Regex, you typically want to be VERY careful, and never press "Replace All" (unless you know EXACTLY what you are doing). I always do single "Find/Replace", and undo/redo, just to double-check and make sure that it is doing what you want.

And with many of these Regex, I just use them to help point out places that have very common errors (like those single lowercase b-z).

This is what I mean when I say using Regex to proofread is a lot faster, and it helps cut down drastically the amount of errors you would have to find/fix on your own.

Here is a great resource to learn Regex: http://www.regular-expressions.info/tutorial.html

There is also this topic on the Sigil forum where they gathered a lot (although be aware, some of these are quite arcane): https://www.mobileread.com/forums/sho...d.php?t=167971

I am not too familiar with whatever Regex is used in Microsoft Word (I don't use Microsoft Word), but as I stated, the "idea" behind many of them are the same. For example, instead of using the symbol '^' for NOT, Word might use '!' instead.

Here is one of the first things that popped up when searching Microsoft Office Regex: https://office.microsoft.com/en-us/h...001087305.aspx

Quote:

Originally Posted by DebbyS

For my current project, I did a search for "any digit"o [any digit + oh] so I could see if the "o" should be "0" (zero).

Yep yep, it sounds like you are tackling something similar as well in Word already (just don't forget to take into account CAPITAL letter 'O' as well). Now you just have to step the complexity level one step up and save yourself more work!

Quote:

Originally Posted by DebbyS

The OCR was also italicizing words it shouldn't have, but it was largely extending italicized words in the Huichol and Spanish languages to the next few English words,

Hmmm... in this book, is it typically only ONE Huichol or Spanish word that is in italics, or is it a whole Huichol and Spanish phrase, followed by English words?