Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 01-25-2011, 11:09 PM   #1
Archon
Zealot
Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!
 
Archon's Avatar
 
Posts: 110
Karma: 5176
Join Date: Dec 2010
Device: Mac OSX, iPad, iPod, & Nook
Regex Question

I am working on a document that has had all the italics formatting removed and replaced with _foo_.

There are two catagories of words:
Those with characters after the word and no space.
Those with spaces after the word in the middle of a sentence.

I can find all occurrances with:
_(.+)_

I can replace those with characters after the word with:
\\i \1\\i0
(so there is no space after the word for characters like periods and commas)

I can replace those with spaces after the word with:
\\i \1 \\i0
(note the space between the 1 and first backslash so there is still a space there)

Example words:
_You_
(with a space after) and

_genocide_.
with a period after.

OK
So how can I efficiently find all occurrences but yet replace the ones with spaces after the word with a the corresponding "\\i \1 \\i0" and replace the ones with a character other than a space after the word with "\\i \1\\i0" in a single pass of the file?

I seem to need something like an "if then else " statement in regex.

Thanks for stopping by.
Free beer tomorrow.
Archon
Archon is offline   Reply With Quote
Old 01-26-2011, 12:23 AM   #2
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Based on the examples provided I'm not sure why _(\w+?)_ and replace with <i>\1</i> won't work for you.

Have you tried enabling 'Italicize common cases' under the Heuristics section of the conversion settings? That feature is supposed to handle this for you automatically. If you have cases where it isn't working you can open a bug with an example file at bugs.calibre-ebook.com.
ldolse is offline   Reply With Quote
Advert
Old 01-26-2011, 07:19 AM   #3
Archon
Zealot
Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!
 
Archon's Avatar
 
Posts: 110
Karma: 5176
Join Date: Dec 2010
Device: Mac OSX, iPad, iPod, & Nook
Sorry, I should clarify.

The file I am working on is an rtf file. So, the replacement example you gave me I believe is HTML?

Also as I tried to explain in my original post the words for replacement fall into two categories. Those with spaces after the word and those with other characters. I am trying to write an expression that would work for both categories in one pass.

Would your replacement example work for both categories in HTML in one pass? If that is true I can switch to editing in HTML if it is more forgiving.

From my experiments, your search suggestion works great and can be used for one category or the other in one pass. I know I can just make two passes but I was curious if there might be a better way.

I am doing this with an regex enabled text editor for the moment to actually learn regex before I have Calibre do it for me and to better understand what exactly Calibre is doing under the hood so I can tweak it if I have to before cutting Calibre loose on multiple documents.

Thanks for the help.
Archon
Archon is offline   Reply With Quote
Old 01-26-2011, 07:44 AM   #4
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
Originally Posted by Archon View Post
I am doing this with an regex enabled text editor for the moment to actually learn regex before I have Calibre do it for me and to better understand what exactly Calibre is doing under the hood so I can tweak it if I have to before cutting Calibre loose on multiple documents.
Be aware, though, that the editor may actually use a different "dialect" for its regular expressions than the Python flavour Calibre uses. For example, I used to test my expressions on Notepad++ and got quite confused about multiline matching, until I found out that the regex lib they used in Notepad++ doesn't support multiline matching.
Also, keep in mind that the search & replace in Calibre takes place on the XHTML interstage Calibre generates during conversion, so you'd obviously need to use HTML then.
Manichean is offline   Reply With Quote
Old 01-26-2011, 08:34 AM   #5
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Generally people on these forums don't do a lot of work with formats that aren't text based. If you convert from rtf to epub using Calibre you can take advantages of Calibre's search and replace feature, which uses the syntax I suggested, as rtf is already converted to basic html at that stage. You can also take advantage of the italicize feature I mentioned and numerous other features designed to make your life easier.

Once it's converted to epub you can do your final edits in Sigil, which is a dedicated ebook editor. It uses a similar regex syntax, and has another sub-forum here on mobileread with users who can help you.

If your final format is intended to be mobi, then you can use Calibre to do one last epub to mobi conversion. This is what a lot of other kindle users are doing.

p.s. - didn't really understand your comment about multiple words - the regex I proposed should work based on the examples you put in your first post - if I'm missing something maybe you need to elaborate. Switching to '\w+?' instead of '.+' which was in your original regex should cause the engine to just match 'words' instead of other characters which I believe may have been tripping you up.

Last edited by ldolse; 01-26-2011 at 08:36 AM.
ldolse is offline   Reply With Quote
Advert
Old 01-26-2011, 08:16 PM   #6
Archon
Zealot
Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!
 
Archon's Avatar
 
Posts: 110
Karma: 5176
Join Date: Dec 2010
Device: Mac OSX, iPad, iPod, & Nook
Just to report back.

The suggestion you had for html works for those with trailing spaces.

However it does not find those with trailing characters such as:

style="margin-left:;"/>"_Alice in Wonderland_." He had heard so

I tried the heuristics processing and the search and replace function.

No Joy.

Thanks for the input
Archon

Last edited by Archon; 01-26-2011 at 08:44 PM.
Archon is offline   Reply With Quote
Old 01-26-2011, 08:32 PM   #7
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
You need to make a character class which includes spaces then:

_([\w\s]+?)_

Note that will wrap around lines in some programs, not sure if you want that - alternative is matching on the literal space character.
ldolse is offline   Reply With Quote
Old 01-26-2011, 10:11 PM   #8
Archon
Zealot
Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!
 
Archon's Avatar
 
Posts: 110
Karma: 5176
Join Date: Dec 2010
Device: Mac OSX, iPad, iPod, & Nook
Woohoo

Yeah that last one you gave me worked pretty good it got a lot of them.

Then I modified it to this:
_([\w\S]+?)_

and got a bunch more.

I have never seen so much italics in a book.

Thanks for the help
I will go back to reading "Mastering Regular Expressions now and see if I can learn some new tricks.
Archon
Archon is offline   Reply With Quote
Old 01-26-2011, 10:44 PM   #9
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
I'm pretty sure \S is anything NOT a space - not one I would recommend using, as you'll have the overmatching problem again.

If italicize common cases under heuristics didn't work then open a bug with the source format of the book at bugs.calibre-ebook.com. The feature is new, so it's not been stressed by a lot of real world data just yet.
ldolse is offline   Reply With Quote
Old 02-02-2011, 09:13 AM   #10
Archon
Zealot
Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!
 
Archon's Avatar
 
Posts: 110
Karma: 5176
Join Date: Dec 2010
Device: Mac OSX, iPad, iPod, & Nook
Just an update.

I was able to come up with a regex that catches everything inside the underscores like _foo_ and replaces it with italicize tags.

The search expression:
_(.*?)_

And replace expression for HTML italicize tags:
<i>\1</i>

This S&R is not greedy and it leaves all characters outside the underscores alone.

Thanks kind people for the help.
Archon
Archon is offline   Reply With Quote
Old 02-05-2011, 09:19 AM   #11
Archon
Zealot
Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!
 
Archon's Avatar
 
Posts: 110
Karma: 5176
Join Date: Dec 2010
Device: Mac OSX, iPad, iPod, & Nook
Just a last update to this.

With more experimentation and reading I learned that the _foo_ format is used by Markdown and Textile markup languages to signify italics or emphasis in a text file.

I have a few .txt files with this markup in the document so I found on Calibre in the "Preferences:Conversions:Input Options:TXT Input" a pulldown box under formatting style to select either Markdown or Textile formatting (instead of auto).

Either of these options converts _foo_ into foo as the txt file is converted to epub (or presumably any other format).

Hopefully this will save someone else a little time on this but I have learned considerably nonetheless from nattering over this problem.

Archon
Archon is offline   Reply With Quote
Old 02-05-2011, 10:13 AM   #12
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Hm... Auto and heuristic type should be converting _ to italic text. Looks like it broke at some point. I'll get that fixed.
user_none is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
What's wrong with this regex? crutledge Sigil 1 05-11-2010 01:49 PM
What a regex is Worldwalker Calibre 20 05-10-2010 05:51 AM
Help with a regex A.T.E. Calibre 1 04-05-2010 07:50 AM
Import files, regex question al35 Calibre 0 03-22-2010 12:33 PM
Regex help... Bobthebass Workshop 6 04-26-2009 03:54 PM


All times are GMT -4. The time now is 01:50 AM.


MobileRead.com is a privately owned, operated and funded community.