Regex Question

Archon · 01-25-2011, 11:09 PM

I am working on a document that has had all the italics formatting removed and replaced with _foo_.

There are two catagories of words:
Those with characters after the word and no space.
Those with spaces after the word in the middle of a sentence.

I can find all occurrances with:
_(.+)_

I can replace those with characters after the word with:
\\i \1\\i0
(so there is no space after the word for characters like periods and commas)

I can replace those with spaces after the word with:
\\i \1 \\i0
(note the space between the 1 and first backslash so there is still a space there)

Example words:
_You_
(with a space after) and

_genocide_.
with a period after.

OK
So how can I efficiently find all occurrences but yet replace the ones with spaces after the word with a the corresponding "\\i \1 \\i0" and replace the ones with a character other than a space after the word with "\\i \1\\i0" in a single pass of the file?

I seem to need something like an "if then else " statement in regex.

Thanks for stopping by.
Free beer tomorrow.
Archon

ldolse · 01-26-2011, 12:23 AM

Based on the examples provided I'm not sure why _(\w+?)_ and replace with <i>\1</i> won't work for you.

Have you tried enabling 'Italicize common cases' under the Heuristics section of the conversion settings? That feature is supposed to handle this for you automatically. If you have cases where it isn't working you can open a bug with an example file at bugs.calibre-ebook.com.

Archon · 01-26-2011, 07:19 AM

Sorry, I should clarify.

The file I am working on is an rtf file. So, the replacement example you gave me I believe is HTML?

Also as I tried to explain in my original post the words for replacement fall into two categories. Those with spaces after the word and those with other characters. I am trying to write an expression that would work for both categories in one pass.

Would your replacement example work for both categories in HTML in one pass? If that is true I can switch to editing in HTML if it is more forgiving.

From my experiments, your search suggestion works great and can be used for one category or the other in one pass. I know I can just make two passes but I was curious if there might be a better way.

I am doing this with an regex enabled text editor for the moment to actually learn regex before I have Calibre do it for me and to better understand what exactly Calibre is doing under the hood so I can tweak it if I have to before cutting Calibre loose on multiple documents.

Thanks for the help.
Archon

Manichean · 01-26-2011, 07:44 AM

Quote:

Originally Posted by Archon

I am doing this with an regex enabled text editor for the moment to actually learn regex before I have Calibre do it for me and to better understand what exactly Calibre is doing under the hood so I can tweak it if I have to before cutting Calibre loose on multiple documents.

Be aware, though, that the editor may actually use a different "dialect" for its regular expressions than the Python flavour Calibre uses. For example, I used to test my expressions on Notepad++ and got quite confused about multiline matching, until I found out that the regex lib they used in Notepad++ doesn't support multiline matching.
Also, keep in mind that the search & replace in Calibre takes place on the XHTML interstage Calibre generates during conversion, so you'd obviously need to use HTML then.

ldolse · 01-26-2011, 08:34 AM

Generally people on these forums don't do a lot of work with formats that aren't text based. If you convert from rtf to epub using Calibre you can take advantages of Calibre's search and replace feature, which uses the syntax I suggested, as rtf is already converted to basic html at that stage. You can also take advantage of the italicize feature I mentioned and numerous other features designed to make your life easier.

Once it's converted to epub you can do your final edits in Sigil, which is a dedicated ebook editor. It uses a similar regex syntax, and has another sub-forum here on mobileread with users who can help you.

If your final format is intended to be mobi, then you can use Calibre to do one last epub to mobi conversion. This is what a lot of other kindle users are doing.

p.s. - didn't really understand your comment about multiple words - the regex I proposed should work based on the examples you put in your first post - if I'm missing something maybe you need to elaborate. Switching to '\w+?' instead of '.+' which was in your original regex should cause the engine to just match 'words' instead of other characters which I believe may have been tripping you up.

Archon · 01-26-2011, 08:16 PM

Just to report back.

The suggestion you had for html works for those with trailing spaces.

However it does not find those with trailing characters such as:

style="margin-left:;"/>"_Alice in Wonderland_." He had heard so

I tried the heuristics processing and the search and replace function.

No Joy.

Thanks for the input
Archon

ldolse · 01-26-2011, 08:32 PM

You need to make a character class which includes spaces then:

_([\w\s]+?)_

Note that will wrap around lines in some programs, not sure if you want that - alternative is matching on the literal space character.

Archon · 01-26-2011, 10:11 PM

Woohoo

Yeah that last one you gave me worked pretty good it got a lot of them.

Then I modified it to this:
_([\w\S]+?)_

and got a bunch more.

I have never seen so much italics in a book.

Thanks for the help
I will go back to reading "Mastering Regular Expressions now and see if I can learn some new tricks.
Archon

ldolse · 01-26-2011, 10:44 PM

I'm pretty sure \S is anything NOT a space - not one I would recommend using, as you'll have the overmatching problem again.

If italicize common cases under heuristics didn't work then open a bug with the source format of the book at bugs.calibre-ebook.com. The feature is new, so it's not been stressed by a lot of real world data just yet.

Archon · 02-02-2011, 09:13 AM

Just an update.

I was able to come up with a regex that catches everything inside the underscores like _foo_ and replaces it with italicize tags.

The search expression:
_(.*?)_

And replace expression for HTML italicize tags:
<i>\1</i>

This S&R is not greedy and it leaves all characters outside the underscores alone.

Thanks kind people for the help.
Archon

Archon · 02-05-2011, 09:19 AM

Just a last update to this.

With more experimentation and reading I learned that the _foo_ format is used by Markdown and Textile markup languages to signify italics or emphasis in a text file.

I have a few .txt files with this markup in the document so I found on Calibre in the "Preferences:Conversions:Input Options:TXT Input" a pulldown box under formatting style to select either Markdown or Textile formatting (instead of auto).

Either of these options converts _foo_ into foo as the txt file is converted to epub (or presumably any other format).

Hopefully this will save someone else a little time on this but I have learned considerably nonetheless from nattering over this problem.

Archon

user_none · 02-05-2011, 10:13 AM

Hm... Auto and heuristic type should be converting _ to italic text. Looks like it broke at some point. I'll get that fixed.

01-25-2011, 11:09 PM	#1
Archon Zealot Posts: 110 Karma: 5176 Join Date: Dec 2010 Device: Mac OSX, iPad, iPod, & Nook	Regex Question I am working on a document that has had all the italics formatting removed and replaced with _foo_. There are two catagories of words: Those with characters after the word and no space. Those with spaces after the word in the middle of a sentence. I can find all occurrances with: _(.+)_ I can replace those with characters after the word with: \\i \1\\i0 (so there is no space after the word for characters like periods and commas) I can replace those with spaces after the word with: \\i \1 \\i0 (note the space between the 1 and first backslash so there is still a space there) Example words: _You_ (with a space after) and _genocide_. with a period after. OK So how can I efficiently find all occurrences but yet replace the ones with spaces after the word with a the corresponding "\\i \1 \\i0" and replace the ones with a character other than a space after the word with "\\i \1\\i0" in a single pass of the file? I seem to need something like an "if then else " statement in regex. Thanks for stopping by. Free beer tomorrow. Archon

01-26-2011, 08:34 AM	#5
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Generally people on these forums don't do a lot of work with formats that aren't text based. If you convert from rtf to epub using Calibre you can take advantages of Calibre's search and replace feature, which uses the syntax I suggested, as rtf is already converted to basic html at that stage. You can also take advantage of the italicize feature I mentioned and numerous other features designed to make your life easier. Once it's converted to epub you can do your final edits in Sigil, which is a dedicated ebook editor. It uses a similar regex syntax, and has another sub-forum here on mobileread with users who can help you. If your final format is intended to be mobi, then you can use Calibre to do one last epub to mobi conversion. This is what a lot of other kindle users are doing. p.s. - didn't really understand your comment about multiple words - the regex I proposed should work based on the examples you put in your first post - if I'm missing something maybe you need to elaborate. Switching to '\w+?' instead of '.+' which was in your original regex should cause the engine to just match 'words' instead of other characters which I believe may have been tripping you up. Last edited by ldolse; 01-26-2011 at 08:36 AM.

01-26-2011, 08:16 PM	#6
Archon Zealot Posts: 110 Karma: 5176 Join Date: Dec 2010 Device: Mac OSX, iPad, iPod, & Nook	Just to report back. The suggestion you had for html works for those with trailing spaces. However it does not find those with trailing characters such as: style="margin-left:;"/>"_Alice in Wonderland_." He had heard so I tried the heuristics processing and the search and replace function. No Joy. Thanks for the input Archon Last edited by Archon; 01-26-2011 at 08:44 PM.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
What's wrong with this regex?	crutledge	Sigil	1	05-11-2010 01:49 PM
What a regex is	Worldwalker	Calibre	20	05-10-2010 05:51 AM
Help with a regex	A.T.E.	Calibre	1	04-05-2010 07:50 AM
Import files, regex question	al35	Calibre	0	03-22-2010 12:33 PM
Regex help...	Bobthebass	Workshop	6	04-26-2009 03:54 PM

01-26-2011, 12:23 AM	#2
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Based on the examples provided I'm not sure why _(\w+?)_ and replace with <i>\1</i> won't work for you. Have you tried enabling 'Italicize common cases' under the Heuristics section of the conversion settings? That feature is supposed to handle this for you automatically. If you have cases where it isn't working you can open a bug with an example file at bugs.calibre-ebook.com.

01-26-2011, 07:19 AM	#3
Archon Zealot Posts: 110 Karma: 5176 Join Date: Dec 2010 Device: Mac OSX, iPad, iPod, & Nook	Sorry, I should clarify. The file I am working on is an rtf file. So, the replacement example you gave me I believe is HTML? Also as I tried to explain in my original post the words for replacement fall into two categories. Those with spaces after the word and those with other characters. I am trying to write an expression that would work for both categories in one pass. Would your replacement example work for both categories in HTML in one pass? If that is true I can switch to editing in HTML if it is more forgiving. From my experiments, your search suggestion works great and can be used for one category or the other in one pass. I know I can just make two passes but I was curious if there might be a better way. I am doing this with an regex enabled text editor for the moment to actually learn regex before I have Calibre do it for me and to better understand what exactly Calibre is doing under the hood so I can tweak it if I have to before cutting Calibre loose on multiple documents. Thanks for the help. Archon

01-26-2011, 08:32 PM	#7
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	You need to make a character class which includes spaces then: _([\w\s]+?)_ Note that will wrap around lines in some programs, not sure if you want that - alternative is matching on the literal space character.

01-26-2011, 10:11 PM	#8
Archon Zealot Posts: 110 Karma: 5176 Join Date: Dec 2010 Device: Mac OSX, iPad, iPod, & Nook	Woohoo Yeah that last one you gave me worked pretty good it got a lot of them. Then I modified it to this: _([\w\S]+?)_ and got a bunch more. I have never seen so much italics in a book. Thanks for the help I will go back to reading "Mastering Regular Expressions now and see if I can learn some new tricks. Archon

01-26-2011, 10:44 PM	#9
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	I'm pretty sure \S is anything NOT a space - not one I would recommend using, as you'll have the overmatching problem again. If italicize common cases under heuristics didn't work then open a bug with the source format of the book at bugs.calibre-ebook.com. The feature is new, so it's not been stressed by a lot of real world data just yet.

02-02-2011, 09:13 AM	#10
Archon Zealot Posts: 110 Karma: 5176 Join Date: Dec 2010 Device: Mac OSX, iPad, iPod, & Nook	Just an update. I was able to come up with a regex that catches everything inside the underscores like _foo_ and replaces it with italicize tags. The search expression: _(.*?)_ And replace expression for HTML italicize tags: <i>\1</i> This S&R is not greedy and it leaves all characters outside the underscores alone. Thanks kind people for the help. Archon

02-05-2011, 09:19 AM	#11
Archon Zealot Posts: 110 Karma: 5176 Join Date: Dec 2010 Device: Mac OSX, iPad, iPod, & Nook	Just a last update to this. With more experimentation and reading I learned that the _foo_ format is used by Markdown and Textile markup languages to signify italics or emphasis in a text file. I have a few .txt files with this markup in the document so I found on Calibre in the "Preferences:Conversions:Input Options:TXT Input" a pulldown box under formatting style to select either Markdown or Textile formatting (instead of auto). Either of these options converts _foo_ into foo as the txt file is converted to epub (or presumably any other format). Hopefully this will save someone else a little time on this but I have learned considerably nonetheless from nattering over this problem. Archon

02-05-2011, 10:13 AM	#12
user_none Sigil & calibre developer Posts: 2,488 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR	Hm... Auto and heuristic type should be converting _ to italic text. Looks like it broke at some point. I'll get that fixed.

Advert

Advert