RegEx Function: Title Case - Page 2

kovidgoyal · 12-30-2014, 08:55 PM

The way those functions work is that they uppercase the contents of any groups in the find expression. You have specified a group that matches H1. You need to specify a group that matches the actual content, like this.

<[Hh][1-6]>(.+?)</[Hh][1-6]>

If you want a case changing function that ignores text in tag definitions in the matched text, then you will need to write one for yourself. The builtin functions wont do that, because, they are for general purpose use, not specifically for changing text between tags.

phossler · 12-31-2014, 10:29 AM

Thanks for your patience and the explanations

Quote:

You need to specify a group that matches the actual content, like this.

Thanks - I'm a little smarter about RegEx now.

Using your Find works exactly as advertised and it correctly finds and highlights the Hx tags.

Quote:

The built in functions won't do that, because, they are for general purpose use, not specifically for changing text between tags.

Understand, but it still seems (to me at least) that there is a possible side effect of the built in TitleCase function

1. It replaces tag markers ('<' and '>') with what is treated like normal text
2. It does not TitleCase the text that it does find

Quote:

'''Title-case matched text. If the regular expression contains groups,
only the text in the groups will be changed, otherwise the entire text is
changed.'''

So I assume that

<[Hh][1-6]>(.+?)</[Hh][1-6]>

would make the \1 group for the Replace just the red text in the Before below?

Before:

Code:

  <h1>TEST1 TEST1 TEST1 TEST1 TEST1 </h1>
  <p>NOW IS THE TIME and this should remain mixed case</p>
  <h1>TEST2 TEST2 TEST2 <br/><br/>TEST3 TEST3 </h1>
  <p>NOW IS THE TIME and this should remain mixed case</p>
  <h1>TEST4 <i>TEST4 TEST4 TEST4</i> TEST4 </h1>

After:

Code:

 <h1>Test1 Test1 Test1 Test1 Test1 </h1>
  <p>NOW IS THE TIME and this should remain mixed case</p>
  <h1>TEST2 TEST2 TEST2 &lt;br/&gt;&lt;br/&gt;TEST3 TEST3 </h1>
  <p>NOW IS THE TIME and this should remain mixed case</p>
  <h1>TEST4 &lt;i&gt;TEST4 TEST4 TEST4&lt;/i&gt; TEST4 </h1>

1. So the simplest case (first H1) works correctly

2. I don't understand why the same logic isn't applied to the second and third so that all text between the Hx's is made title case, as well as why the replacement of < and > with entities which end up being treated like normal text

kovidgoyal · 12-31-2014, 10:45 AM

The logic is simple:

*Everything* that matches the expression inside the brackets is made upper case. Furthermore, the function treats all that text as plain text, not a mix of HTML and plain text. That means that because the output of the function is being put into an HTML file < and > get replaced by entities.

Or in other words, that function is not designed to be used in the way you are trying to use it.

You need to come up with a function that understands that it could be operating on a mixture of HTML tags and plain text and so restricts itself to only the plain text parts.

kovidgoyal · 12-31-2014, 10:46 AM

I have created a builtin function for you that does that, in the next release.

https://github.com/kovidgoyal/calibr...151ff7a9946577

jbacelar · 12-31-2014, 11:10 AM

Paul,
What it seeks this expression: <[Hh] [1-6]>(.+?)</ [Hh] [1-6]>, is:
<one h (or H) followed by a number (1 to 6)>anything</ another h followed by another number>

Here:
<h followed by one number> anything </ br or </i

br or i is not one h followed by a number.

I recommend that if you want to use regex, visit this website:
http://www.regular-expressions.info/tutorial.html

phossler · 12-31-2014, 03:19 PM

@kovid -- THANKS!!!! I can see I'll have to learn at least a little python

I was confused by the apparent different treatment of the TitleCase function between the first (simplest) sentence "Where It Worked Just Fine" and the second and third where IT LEFT EVERYTHING IN UPPER CASE

@jbacelar -- The Find Kovid gave me seems to work fine. It would select all this H1 text, including the <h1> and </h1> ...

<h1>TEST2 TEST2 TEST2 <br/><br/>TEST3 TEST3 </h1>

After the Replace

<h1>TEST2 TEST2 TEST2 <br/><br/>TEST3 TEST3 </h1>

What was confusing me was that the text was not in title case. I understand the replaced entities now

I believe that Kovid's new built-in function is the only way to handle these types of cases

Ted Friesen · 06-26-2020, 02:04 PM

I'm also having trouble with the "Title-case text (ignore tags)" built-in function. I've wrapped all the UPPER case text that I want to convert to Title case in <h2> tags and am using the search parameter "(?s)<h\d>(.+?)</h\d>".
Applying "Replace-all" results in a deletion of all H tags and the intervening text. No conversion just deletion.
Editing the built-in function, this is what I see:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
return ''
Shouldn't there be more to it?

kovidgoyal · 06-26-2020, 10:57 PM

https://manual.calibre-ebook.com/fun...n-the-document

phossler · 06-27-2020, 08:22 PM

@Ted - Your PM

I actually run two steps: one to upper case headings, and then a second to title case them

These are my saved searches and this is the function listing for 'Title case text - Ignore tags'

Code:

from calibre.utils.titlecase import titlecase
from calibre.ebooks.oeb.polish.utils import apply_func_to_html_text

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    '''Title-case matched text, ignoring the text inside tag definitions.'''
    return apply_func_to_html_text(match, titlecase)

Ted Friesen · 06-29-2020, 03:43 PM

Thanks Paul it worked!!

Question: When I Create/Edit built-in functions is there supposed to be some code there?

phossler · 06-29-2020, 07:23 PM

I have mine as a Saved Search, but you can also do it ad hoc

There is code there that defines the function, but I don't know Python so I never created any

I guess you can create your own function

The Calibre Users' Manual is one of the best I've seen in a long time:

https://manual.calibre-ebook.com/function_mode.html

Ted Friesen · 07-02-2020, 07:57 PM

Thanks for the code Paul.

It sounds like you didn't code the function you sent me, but that it was "built-in". When I choose any of the dozen or so built-in functions the code is always the same

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
return ''

Does your installation of Calibre actually display appropriate function code when you choose different built-in functions? If so, any ideas why?

phossler · 07-03-2020, 10:27 AM

Yes, it was one of the built in ones that Calibre supplies

Quote:

Automatically fixing the case of headings in the document
Here, we will leverage one of the builtin functions in the editor to automatically change the case of all text inside heading tags to title case:

Find expression: <([Hh][1-6])[^>]*>.+?</\1>

For the function, simply choose the Title-case text (ignore tags) builtin function. The will change titles that look like: <h1>some TITLE</h1> to <h1>Some Title</h1>. It will work even if there are other HTML tags inside the heading tags.

Don't know why. If I look at the 'code' for the function, I see the attached

Ted Friesen · 07-03-2020, 01:47 PM

From recent replies, it seems that clicking on create/edit regex-function should reveal the code for built-in functions. My installation (Calibre 4.19 64-bit on Microsoft Windows [Version 10.0.18363.900]) does not.

Any ideas why? Have I turned something off inadvertently? Have I failed to install some module? Should I still be using OS/2?

Having some functions (more than what's in the manual) to play with will help me learn enough to fix my epub book library.

Any help you can give me (samples of code and search strings) will be greatly appreciated.

davidfor · 07-04-2020, 10:52 AM

Quote:

Originally Posted by Ted Friesen

Thanks for the code Paul.

It sounds like you didn't code the function you sent me, but that it was "built-in". When I choose any of the dozen or so built-in functions the code is always the same

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
return ''

Does your installation of Calibre actually display appropriate function code when you choose different built-in functions? If so, any ideas why?

Quote:

Originally Posted by Ted Friesen

From recent replies, it seems that clicking on create/edit regex-function should reveal the code for built-in functions. My installation (Calibre 4.19 64-bit on Microsoft Windows [Version 10.0.18363.900]) does not.

Any ideas why? Have I turned something off inadvertently? Have I failed to install some module?

What you have above looks like the default code if you open the function editor without a name in the "Function" field in the find box. And if you then use the "Function name" dropbox to select another function, it doesn't update the displayed code. I don't know if that is deliberate or not. I can see it working either way. If you select the function name in the find box, and then open the function editor, you get the code for that function.

Quote:

Should I still be using OS/2?

If only we could

Quote:

Having some functions (more than what's in the manual) to play with will help me learn enough to fix my epub book library.

Any help you can give me (samples of code and search strings) will be greatly appreciated.

I don't have any useful examples. I've tried a couple of things, but, it's actually the search that ends up being the problem, not the update.

06-26-2020, 02:04 PM	#22
Ted Friesen Nameless Being	Title-case text built-in function I'm also having trouble with the "Title-case text (ignore tags)" built-in function. I've wrapped all the UPPER case text that I want to convert to Title case in <h2> tags and am using the search parameter "(?s)<h\d>(.+?)</h\d>". Applying "Replace-all" results in a deletion of all H tags and the intervening text. No conversion just deletion. Editing the built-in function, this is what I see: def replace(match, number, file_name, metadata, dictionaries, data, functions, args, *kwargs): return '' Shouldn't there be more to it?

06-27-2020, 08:22 PM	#24
phossler Wizard Posts: 1,095 Karma: 447222 Join Date: Jan 2009 Location: Valley Forge, PA, USA Device: Kindle Paperwhite	@Ted - Your PM I actually run two steps: one to upper case headings, and then a second to title case them These are my saved searches and this is the function listing for 'Title case text - Ignore tags' Code: from calibre.utils.titlecase import titlecase from calibre.ebooks.oeb.polish.utils import apply_func_to_html_text def replace(match, number, file_name, metadata, dictionaries, data, functions, args, *kwargs): '''Title-case matched text, ignoring the text inside tag definitions.''' return apply_func_to_html_text(match, titlecase) Attached Thumbnails

06-29-2020, 03:43 PM	#25
Ted Friesen Nameless Being	Title-case text built-in function Thanks Paul it worked!! Question: When I Create/Edit built-in functions is there supposed to be some code there?

06-29-2020, 07:23 PM	#26
phossler Wizard Posts: 1,095 Karma: 447222 Join Date: Jan 2009 Location: Valley Forge, PA, USA Device: Kindle Paperwhite	I have mine as a Saved Search, but you can also do it ad hoc There is code there that defines the function, but I don't know Python so I never created any I guess you can create your own function The Calibre Users' Manual is one of the best I've seen in a long time: https://manual.calibre-ebook.com/function_mode.html Attached Thumbnails Last edited by phossler; 06-29-2020 at 07:25 PM.

07-03-2020, 01:47 PM	#29
Ted Friesen Nameless Being	Built-in regex-functions code missing From recent replies, it seems that clicking on create/edit regex-function should reveal the code for built-in functions. My installation (Calibre 4.19 64-bit on Microsoft Windows [Version 10.0.18363.900]) does not. Any ideas why? Have I turned something off inadvertently? Have I failed to install some module? Should I still be using OS/2? Having some functions (more than what's in the manual) to play with will help me learn enough to fix my epub book library. Any help you can give me (samples of code and search strings) will be greatly appreciated.

12-30-2014, 08:55 PM	#16
kovidgoyal creator of calibre Posts: 46,357 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The way those functions work is that they uppercase the contents of any groups in the find expression. You have specified a group that matches H1. You need to specify a group that matches the actual content, like this. <[Hh][1-6]>(.+?)</[Hh][1-6]> If you want a case changing function that ignores text in tag definitions in the matched text, then you will need to write one for yourself. The builtin functions wont do that, because, they are for general purpose use, not specifically for changing text between tags.

12-31-2014, 10:45 AM	#18
kovidgoyal creator of calibre Posts: 46,357 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	The logic is simple: Everything that matches the expression inside the brackets is made upper case. Furthermore, the function treats all that text as plain text, not a mix of HTML and plain text. That means that because the output of the function is being put into an HTML file < and > get replaced by entities. Or in other words, that function is not designed to be used in the way you are trying to use it. You need to come up with a function that understands that it could be operating on a mixture of HTML tags and plain text and so restricts itself to only the plain text parts.

12-31-2014, 10:46 AM	#19
kovidgoyal creator of calibre Posts: 46,357 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I have created a builtin function for you that does that, in the next release. https://github.com/kovidgoyal/calibr...151ff7a9946577

12-31-2014, 11:10 AM	#20
jbacelar Interested in the matter Posts: 421 Karma: 426094 Join Date: Dec 2011 Location: Spain, south coast Device: Pocketbook InkPad 3	Paul, What it seeks this expression: <[Hh] [1-6]>(.+?)</ [Hh] [1-6]>, is: <one h (or H) followed by a number (1 to 6)>anything</ another h followed by another number> Here: <h followed by one number> anything </ br or </i br or i is not one h followed by a number. I recommend that if you want to use regex, visit this website: http://www.regular-expressions.info/tutorial.html

12-31-2014, 03:19 PM	#21
phossler Wizard Posts: 1,095 Karma: 447222 Join Date: Jan 2009 Location: Valley Forge, PA, USA Device: Kindle Paperwhite	@kovid -- THANKS!!!! I can see I'll have to learn at least a little python I was confused by the apparent different treatment of the TitleCase function between the first (simplest) sentence "Where It Worked Just Fine" and the second and third where IT LEFT EVERYTHING IN UPPER CASE @jbacelar -- The Find Kovid gave me seems to work fine. It would select all this H1 text, including the <h1> and </h1> ... <h1>TEST2 TEST2 TEST2 <br/><br/>TEST3 TEST3 </h1> After the Replace <h1>TEST2 TEST2 TEST2 <br/><br/>TEST3 TEST3 </h1> What was confusing me was that the text was not in title case. I understand the replaced entities now I believe that Kovid's new built-in function is the only way to handle these types of cases

06-26-2020, 10:57 PM	#23
kovidgoyal creator of calibre Posts: 46,357 Karma: 29630884 Join Date: Oct 2006 Location: Mumbai, India Device: Various	https://manual.calibre-ebook.com/fun...n-the-document

07-02-2020, 07:57 PM	#27
Ted Friesen Nameless Being	Thanks for the code Paul. It sounds like you didn't code the function you sent me, but that it was "built-in". When I choose any of the dozen or so built-in functions the code is always the same def replace(match, number, file_name, metadata, dictionaries, data, functions, args, *kwargs): return '' Does your installation of Calibre actually display appropriate function code when you choose different built-in functions? If so, any ideas why?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
regex-function convert roman numerals	weberr	Editor	11	09-22-2021 05:15 PM
A regex function to number a mathematical ebook	dmonasse	Editor	3	12-23-2014 02:54 AM
Regex Function - Split unknown word	Paulie_D	Editor	19	12-07-2014 05:12 AM
Regex for Title Case or Sentence case?	Turtle91	Sigil	3	01-19-2013 01:36 PM
Dutch title case function	fvdham	Library Management	8	10-11-2012 10:09 PM