Register Guidelines E-Books Search Today's Posts Mark Forums Read

 04-01-2013, 11:18 PM #1 Gregg Bell Gregg Bell     Posts: 800 Karma: 3218924 Join Date: Jan 2013 Location: Itasca, Illinois Device: none search question Hey. I was wanting to search for id (as in 'the cop flashed him his ID') in Sigil. (In other "Find" things it will have something like 'find whole word only.' ) Anyway I couldn't find anything like that in sigil, so I got every id in every word. Eg. Braid, staid, etc. Any way to find the exact word? thanks!
 04-02-2013, 12:34 AM #2 Turtle91 Guru     Posts: 669 Karma: 3807234 Join Date: Dec 2012 Location: Shannon, Ireland today Device: iPhone 5/iPad 1&2/Surface Pro/Kindle PW Try caps only "ID" and/or put a space before: " ID"
 Enthusiast
 04-02-2013, 12:39 AM #3 GrannyGrump Persnickity Nitpicker     Posts: 552 Karma: 2587520 Join Date: May 2011 Location: JAPAN (US expatriate) Device: Sony PRS-T2, ADE on PC I know the regex gurus will have some wonderful black magic to do this, but the simplest way I've found is to include the space before ID. As in, when you are in the find box, hit spacebar, then type ID. If it is always in caps, type it that way, and on the drop-down list, choose "case sensitive." Now if you have several hundred words beginning with the letters "ID", you might have to wait for some regex magic. But meanwhile, you could use the "replace/find" to go through your document one word at a time, and if you strike one that isn't the ID you want to change, just click the "find " button to jump ahead. EDIT --- I see the Mighty Turtle beat me, I am a slow typist.... The Tortoise and the Hare? Last edited by GrannyGrump; 04-02-2013 at 01:01 AM.
04-02-2013, 03:23 AM   #4
Tex2002ans
Evangelist

Posts: 424
Karma: 360129
Join Date: Jul 2012
Device: Nook
Quote:
 Originally Posted by grannyGrumpy I know the regex gurus will have some wonderful black magic to do this, [...]
No black magic, just using a "word boundary" (\b), more explanation can be found here:

http://www.regular-expressions.info/wordboundaries.html

This regex will match all cases of "ID":

Code:
\bID\b
I would also recommend using the great Spellcheck that was added into Sigil 0.7.0, which can list every single word in the EPUB.

Tools - Spellcheck - Spellcheck (Alt+Q)

Once at the Spellcheck screen, you can put a checkbox in "Show All Words". Then feel free to find whatever word you are looking for in the list, OR search for it by using the "Filter" box at the top.

You can then double click on the word in the list in order to jump to its position throughout the book.

04-02-2013, 02:38 PM   #5
Gregg Bell
Gregg Bell

Posts: 800
Karma: 3218924
Join Date: Jan 2013
Location: Itasca, Illinois
Device: none
thank you!

Quote:
 Originally Posted by Turtle91 Try caps only "ID" and/or put a space before: " ID"
Thanks Dion. It worked great.

Quote:
 Originally Posted by grannyGrumpy I know the regex gurus will have some wonderful black magic to do this, but the simplest way I've found is to include the space before ID. As in, when you are in the find box, hit spacebar, then type ID. If it is always in caps, type it that way, and on the drop-down list, choose "case sensitive." Now if you have several hundred words beginning with the letters "ID", you might have to wait for some regex magic. But meanwhile, you could use the "replace/find" to go through your document one word at a time, and if you strike one that isn't the ID you want to change, just click the "find " button to jump ahead. EDIT --- I see the Mighty Turtle beat me, I am a slow typist.... The Tortoise and the Hare?
Thanks granny. I was looking for lower case id. Your way worked perfectly.

Quote:
 Originally Posted by Tex2002ans No black magic, just using a "word boundary" (\b), more explanation can be found here: http://www.regular-expressions.info/wordboundaries.html This regex will match all cases of "ID": Code: \bID\b I would also recommend using the great Spellcheck that was added into Sigil 0.7.0, which can list every single word in the EPUB. Tools - Spellcheck - Spellcheck (Alt+Q) Once at the Spellcheck screen, you can put a checkbox in "Show All Words". Then feel free to find whatever word you are looking for in the list, OR search for it by using the "Filter" box at the top. You can then double click on the word in the list in order to jump to its position throughout the book.
Thanks Tex. Love the spell check feature and "show all words." And the regex thing \btext\b is just awesome. And thanks for the link. I saved it and will explore it further. This is my very first experience using Regex (I've always been a little intimidated by it) and thanks to you it's a great one. Appreciate it!

04-02-2013, 07:50 PM   #6
Tex2002ans
Evangelist

Posts: 424
Karma: 360129
Join Date: Jul 2012
Device: Nook
Quote:
 Originally Posted by Gregg Bell Thanks Tex. Love the spell check feature and "show all words."
Works even better than I first imagined/requested. I remember having to program a "word count" program along those lines back in high school as one of our introductory programs to learn about string input, arrays, and using loops. I always imagined it would be incredibly useful as a spellcheck... and it is!

I have zero idea why your typical word processor does not have spellcheck functionality anywhere near what is in Sigil currently.

Quote:
 Originally Posted by Gregg Bell And the regex thing \btext\b is just awesome.
Yeah, I learned about \b from someone on these forums (I forget the user), but once I saw it in usage, it was genius. (Perhaps it was in the sticky: Regex Examples.)

Quote:
 Originally Posted by Gregg Bell This is my very first experience using Regex (I've always been a little intimidated by it) and thanks to you it's a great one. Appreciate it!
It is EXTREMELY powerful, and you have to be very careful sometimes to make sure you do not delete important information, especially when using symbols such as "+" (more than 1), or "*" (more than 0).

I learned most of my stuff from the Regex Tutorial:

http://www.regular-expressions.info/tutorial.html

I have a big Regex collection that I use all the time, most notably:

- A Sigil "group" to clean up all the ABBYY Finereader cruft
- Swapping footnotes from superscript footnotes -> [#] format
- Combining broken paragraphs (happens VERY often in OCR)
- "Fixing" the TOC code from Sigil (auto changing the Sigil format to match my "toc" classes in my CSS).

This is one that I use quite often to fix "en dashes" (See https://en.wikipedia.org/wiki/Dash#En_dash):

Search:

Code:
([0-9])-([0-9])
Replace:

Code:
\1–\2
This Regex will look for two numbers separated by a hyphen, and replace the hyphen with an en dash. I step through and replace one by one to make sure that the en dash belongs. Very helpful for adding them between page numbers/years.

 04-02-2013, 11:10 PM #7 GrannyGrump Persnickity Nitpicker     Posts: 552 Karma: 2587520 Join Date: May 2011 Location: JAPAN (US expatriate) Device: Sony PRS-T2, ADE on PC Tex2002ans, thank you for links and samples. Most helpful.
04-02-2013, 11:34 PM   #8
Gregg Bell
Gregg Bell

Posts: 800
Karma: 3218924
Join Date: Jan 2013
Location: Itasca, Illinois
Device: none
thanks Tex

Quote:
 Originally Posted by Tex2002ans I have zero idea why your typical word processor does not have spellcheck functionality anywhere near what is in Sigil currently.
Tex, Thanks for all the links and explanations and the warning about deleting data (I can sense how powerful Regex is).

A couple of quick questions about the Sigil spell check. I haven't been able to add contractions, and I have a lot of them, to the default dictionary. Know of a way?

And how do you find words with accents (not necessarily in Regex)? Words like cafe or elan or facade (with the funky thing on the bottom--I don't even know what's called.).

Thanks for sharing all this great stuff.

04-03-2013, 01:23 AM   #9
Tex2002ans
Evangelist

Posts: 424
Karma: 360129
Join Date: Jul 2012
Device: Nook
Quote:
 Originally Posted by Gregg Bell Tex, Thanks for all the links and explanations and the warning about deleting data (I can sense how powerful Regex is).
But it is great for catching up errors that are impossible to catch using normal search (like missing closing quotation marks), or cleaning up code cruft (whenever I run into Calibre code I turn into a Regex rage).

Quote:
 Originally Posted by Gregg Bell A couple of quick questions about the Sigil spell check. I haven't been able to add contractions, and I have a lot of them, to the default dictionary. Know of a way?
There was a topic a few weeks ago about contractions:

Someone needs to be kind enough to create a Sigil group that everyone can use to easily convert them back and forth, and catch any common missing apostrophes. That would be extremely helpful.

Also, I believe in 0.7.1 (?) the spellcheck dealing with words including apostrophes became broken again. If you "Ignore" the word it still says it is spelled wrong. I would just wait until this bug is fixed again like it was in 0.7.0.

I forget the explanation that was given for the change to "fix apostrophes" (I believe it was to fix a certain foreign language). (If I remember correctly the explanation was given in one of those Sigil release topics).

Quote:
 Originally Posted by Gregg Bell And how do you find words with accents (not necessarily in Regex)?
There are lots of ways to do this. (I can give you the Regex method if you want, although Unicode Regex can get a little ugly (and I don't mess around with it too much so the Regexes will not be time tested by me)).

I personally use two ways. I just use the Sigil spellcheck system to get some funky characters.

OR

I use Tools - Reports - "Characters in HTML Files". The "Characters in HTML Files" will show you every single character that is actually used in the files. I then search through this quickly for any funky ones. You can then double click and search through your EPUB to find all instances of it. For example, quite often OCR adds in double angle quotation marks « ». I can easily spot these in the HTML character list, and fix them up.

Quote:
 Originally Posted by Gregg Bell Words like cafe or elan or facade (with the funky thing on the bottom--I don't even know what's called.).
Since most of my work is done from the actual PDF scans, I usually am the one inserting all the foreign characters as I find them in ABBYY Finereader. Most of the time I just look up and copy/paste from Wikipedia/Fileformat.info.

Allow me to Quote myself from that previous topic I helped you in:

Quote:
 Originally Posted by Tex2002ans There are also very nice lists of characters with accents. I constantly keep tabs open in Firefox for the Wikipedia pages for Macron, Grave accent, Acute accent, Diaresis, Circumflex, Caron, Dagger.

The 'c' with a funky squiggly below it 'ç' is a cedilla (one of the great things of fileformat.info is you can type in any character, and get almost every derivation/symbol of it). For example, I just searched the letter 'c', and figured out what the squiggly was:

http://www.fileformat.info/info/unic...preview=entity

Quote:
 Originally Posted by Gregg Bell Thanks for sharing all this great stuff.
No problem, the goal is to make everyone better at making higher quality books more quickly, and giving everyone the skills for more thorough error checking.

Last edited by Tex2002ans; 04-03-2013 at 02:03 AM.

04-03-2013, 02:31 PM   #10
Gregg Bell
Gregg Bell

Posts: 800
Karma: 3218924
Join Date: Jan 2013
Location: Itasca, Illinois
Device: none
thanks

Quote:
 Originally Posted by Tex2002ans No problem, the goal is to make everyone better at making higher quality books more quickly, and giving everyone the skills for more thorough error checking.
Tex, Thanks for all the info. I looked at all of the links and will go over them later to gather more of what they are saying.

A couple of questions though. I'm in the final stages of proofreading a novel. I was about one third through it. Then, having fallen totally in love with your b\text\b Regex search tool I started experimenting with it and looking for various things, trying to really get a feel for how it might help me. One of the things I did was put in various punctuation marks, including a straight quotation mark ("), which I often inadvertently put in when editing. Well, I finished up last night and when I came to the mss. in the morning I saw one straight quotation mark right in the very beginning of the book. (As I recall in the code it was not bracketed. It was just plopped down next to the end bracket for Chapter One.)

In that I was doing a final proofread this threw me. I of course wondered if the regex searching had added any other little things. I know you warned about regex deleting things, but can it add things as well? (I really can't think of any other possible way that quotation mark could have got there. And I have started proofreading again from the beginning and I'd say I'm about one-sixth through now and I have not seen any additional things that shouldn't be there.)

And a follow-up question: Perhaps (if indeed Regex can add things) it would be wise to only use Regex in the beginning phases of cleaning a document up?

And I'm also a little concerned about how and what it might delete. (Yes, it seems great but scary! And remember I'm just doing my own books--and really they're pretty clean to begin with. Maybe I should leave Regex to pros like you?)

Thanks!

04-03-2013, 03:43 PM   #11
Tex2002ans
Evangelist

Posts: 424
Karma: 360129
Join Date: Jul 2012
Device: Nook
Quote:
 Originally Posted by Gregg Bell A couple of questions though. I'm in the final stages of proofreading a novel. I was about one third through it. Then, having fallen totally in love with your b\text\b Regex search tool I started experimenting with it and looking for various things, trying to really get a feel for how it might help me. One of the things I did was put in various punctuation marks, including a straight quotation mark ("), which I often inadvertently put in when editing.
Well now, this is where you can't just go using any Regex under the sun without UNDERSTANDING what it is actually doing first. Regex is extremely powerful.

\btext\b should only be used if you want to find a SPECIFIC WORD. That regex tutorial I linked above uses this example sentence:

Code:
This island is beautiful
If your did a typical search for "is" in your book, you will get 3 matches (red + blue).

If you use the regex "\bis\b", you ONLY get the blue (the EXACT WORD "is").

In english, the "\b" in a regex means:

In this location there is a space OR punctuation mark (!?,."'<> ......) OR pretty much any NON-WORD character at that position.

Case 1:

Code:
\bis\b
In english, this says, first look for a NON-WORD CHARACTER, then for a lowercase 'i', then look for a lowercase 's', then look for a NON-WORD CHARACTER.

(Match is above)

Case 2:

Code:
\bis
In english, this says, first look for a NON-WORD CHARACTER, then look for a lowercase 'i', then look for a lowercase 's'.

Code:
This island is beautiful
Case 3:

Code:
is\b
In english, this says, first look for a lowercase 'i', then look for a lowercase 's', then look for a NON-WORD CHARACTER.

Code:
This island is beautiful
If you wanted to look for STRAIGHT QUOTES, it gets a little uglier, this becomes slightly more complicated using Regex because they are used all over the place in the actual code (classes are surrounded by straight quotes). What would probably be easiest is searching the actual Word document/whatever you typed for the straight quotes, and then going over to Sigil to fix them.

I can go through explaining a straight quote regex for you if you want. But I don't want you running around ruining your book!

Quote:
 Originally Posted by Gregg Bell In that I was doing a final proofread this threw me. I of course wondered if the regex searching had added any other little things. I know you warned about regex deleting things, but can it add things as well?
Yep, if you are not careful you can add things if you don't use the proper Replace (especially if you don't know what you are doing and start using more complex Searches).

Punctuation in Regex gets much uglier (you have to be very careful because many punctuation marks MEAN something in regex). Example of the most common ones:

. = Any character
+ = More than 1 character
* = More than 0 characters

What most likely happened was by you inserting a punctuation mark, it completely changed the meaning of the regex, which began messing some things up.

You better be saving lots of backups before running these regex, don't want you accidentally deleting sections and not being able to get it back. ALWAYS save an alternate copy before messing with things.

Quote:
 Originally Posted by Gregg Bell And a follow-up question: Perhaps (if indeed Regex can add things) it would be wise to only use Regex in the beginning phases of cleaning a document up?
I would not recommend Regex unless you know what you are doing, or are EXTREMELY careful (and do very thorough testing). And NEVER "Replace All" unless it is a very time tested Regex and you know EXACTLY what it does.

It is sort of like when you copy/paste commands that you find online to run things on the commandline. You should really KNOW EXACTLY what the command is telling your computer to do BEFORE you run the command. The command CAN be powerful enough to erase every single directory, but since you don't understand it at all, you just copy/paste and run it!!!

As you can see, in Case 1, I ONLY get the exact word "is", in Case 2, I can get every single word that begins with "is", in Case 3, I can get every single word that ends with "is".

The Regexes almost look exactly the same but they are wildly different.

Quote:
 Originally Posted by Gregg Bell And I'm also a little concerned about how and what it might delete. (Yes, it seems great but scary! And remember I'm just doing my own books--and really they're pretty clean to begin with. Maybe I should leave Regex to pros like you?)
Good ol search and replace, and the other easy tools already available (Spellcheck, that "Characters" Report I mentioned, normal Search and Replace, ...), will probably help you a lot more. You are already working with a very clean document, I don't believe there would be too many mistakes in there. It is not like you are working from an OCR which will introduce many errors which need fixing.

If you need someone else to take a look at your book for you (I might be able to catch a few mistakes), feel free to send your book my way.

Feel free to email me at (my username) @gmail.com

 04-03-2013, 05:17 PM #12 meme Sigil developer   Posts: 1,275 Karma: 1101600 Join Date: Jan 2011 Location: UK Device: Kindle PW, K4 NT, K3, Kobo Touch Always remember that replacing is done in Code View and you can easily delete HTML code tags or attributes if you aren't careful. .mr-forums-btf-lastpost-in-sig-l-text-sm-height { width: 650px; height: 150px; margin-top: 10px; }