ID in my ebook - regular expression - Page 2

geniale · 09-15-2015, 02:52 PM

Thanks to all of you, finally i did it!!!
Thom*'s formula is great: (.*?)

The group (.*?) changed with ([0-9]) for 1-9, ([0-9][0-9]) for 10-99, ([0-9][0-9][0-9]) for 100 to 999, makes the job.

Wow it look nice!!! And quickly.

Thom* · 09-16-2015, 12:40 PM

I was so inspired by davidfor's suggestion to use Regex-Funtion mode that I delved into it and wrote my first function. It was a challenge that I could not pass up. I do believe that it works just as you requested, so I might as well share it with you.

So:
- Begin by selecting "Regex-Function" in the search/replace panel.
- Leave the "All text files" selected.
- Make sure there are no files listed in the "Function:" box so that when you hit "Create/edit" a new file will be created.
- Click "Create/edit" and name the new file what you like (I named it "FixIt").
- Copy the following code to the body of the new file replacing anything that is there.

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):

    newid = ('<p id="id_' + (file_name [-10:-5:]) + ("000" + match.group(1)) [-3:] + '" class="calibre1"><sup class="calibre2">' + match.group(1) + '</sup>')

    return newid

- Check to be sure you have 0 spaces before "def replace", 4 spaces before "newid" and 4 spaces before "return".
- Click "OK" and you are ready to rock and roll.
- Use the search string that I provided you previously: (.*?)
- Hit replace and you should be in business.
- It should do what you want to all files.

Please let me know if this works for you or what problems you encounter.

That was great fun, thanks for the challenge.

cybmole · 09-17-2015, 01:55 AM

that is interesting, could you talk us through it please.
I can intuit what most of the code is doing, but what is the significance of needing exactly 4 spaces in two locations?

davidfor · 09-17-2015, 02:26 AM

Quote:

Originally Posted by cybmole

that is interesting, could you talk us through it please.
I can intuit what most of the code is doing, but what is the significance of needing exactly 4 spaces in two locations?

It is Python code. Rather than using delimiters of some sort, Python uses the indent levels to indicate code blocks. The "def" line is defining a function. Any code lines in the function have to be indented under it. The indent is usually four spaces, but a single space or a tab should work.

cybmole · 09-17-2015, 03:07 AM

Quote:

Originally Posted by davidfor

It is Python code. Rather than using delimiters of some sort, Python uses the indent levels to indicate code blocks. The "def" line is defining a function. Any code lines in the function have to be indented under it. The indent is usually four spaces, but a single space or a tab should work.

that makes sense thanks - 4 spaces seemed arbitrary. The stuff in square brackets is extracting specific parts of character strings ?

eschwartz · 09-17-2015, 03:12 AM

https://stackoverflow.com/questions/...slice-notation

Thom* · 09-17-2015, 10:44 AM

Here I have documented the function in detail and I have included the debug (print) statements so you can see the results. I hope it helps.

Code:

'''Begin with what I assume is kovidgoyal's Python funtion "replace" that gives access to file_name and search data.'''
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):

    '''Simply key in the desired text to begin the replace string.'''
    str1 = '<p id="id_'
    print(str1)

    '''Call the file_name and extract the 10th (inclusive) to the 5th (exclusive) from the end.'''
    str2 = file_name [-10:-5:]
    print(str2)

    '''Call the match, add zeros to the front and extract the last 3 digits.'''
    str3 = ("000" + match.group(1)) [-3:]
    print(str3)

    '''Key in the desired text for the middle of the replace string.'''
    str4 = '" class="calibre1"><sup class="calibre2">'
    print(str4)

    '''Call the match (unmolested).'''
    str5 = match.group(1)
    print(str5)

    '''Key in the desired text for the end of the replace string.'''
    str6 = '</sup>'
    print(str6)

    '''Concatenate the string.'''
    newid = (str1 + str2 + str3 + str4 + str5 + str6)
    print(newid)

    '''Return the replace string.'''
    return newid

geniale · 09-18-2015, 04:00 AM

Quote:

Originally Posted by Thom*

I was so inspired by davidfor's suggestion to use Regex-Funtion mode that I delved into it and wrote my first function. It was a challenge that I could not pass up. I do believe that it works just as you requested, so I might as well share it with you.

So:
- Begin by selecting "Regex-Function" in the search/replace panel.
- Leave the "All text files" selected.
- Make sure there are no files listed in the "Function:" box so that when you hit "Create/edit" a new file will be created.
- Click "Create/edit" and name the new file what you like (I named it "FixIt").
- Copy the following code to the body of the new file replacing anything that is there.

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):

    newid = ('<p id="id_' + (file_name [-10:-5:]) + ("000" + match.group(1)) [-3:] + '" class="calibre1"><sup class="calibre2">' + match.group(1) + '</sup>')

    return newid

- Check to be sure you have 0 spaces before "def replace", 4 spaces before "newid" and 4 spaces before "return".
- Click "OK" and you are ready to rock and roll.
- Use the search string that I provided you previously: (.*?)
- Hit replace and you should be in business.
- It should do what you want to all files.

Please let me know if this works for you or what problems you encounter.

That was great fun, thanks for the challenge.

*****************
Wowwwwwwwwwww it's amazing Thom*. The function works and do ALL the job. This is what i was looking for!!!!!!!!!

geniale · 09-18-2015, 04:41 AM

Thom* the function works great it amazing.

Form file v3001 to v3011 goes VERY fine counting from 1 to 24. When it goes to v3012 to v3017 it seems that skips the number 1 and starts from 2 and the space where should be number 1 nothing happens. Why is that?

From v3001 to v3011:
1
2
....
17

The file v3011:
11
2
3
....
36

From v3012 to v3017:
17
2
3
.....
16

Every file begins with the number of the chapter 1-17 that has different format. The strange thing is that from 1 to 11 your function works great and from chapter 12 skips ONLY the first paragraph and starts counting from paragraph 2.

Thom* · 09-18-2015, 09:10 AM

Wow, this looks like a whole different scenario and I can't quite follow your question.

If the search and replace is skipping the first paragraph on some files, probably the search criteria is not met. Test that by simply running the search (skip the replace) and see if the paragraph is found.

It is common in ebooks for the first paragraph of a chapter to have slightly different format.

geniale · 09-18-2015, 12:36 PM

Quote:

Originally Posted by Thom*

Wow, this looks like a whole different scenario and I can't quite follow your question.

If the search and replace is skipping the first paragraph on some files, probably the search criteria is not met. Test that by simply running the search (skip the replace) and see if the paragraph is found.

It is common in ebooks for the first paragraph of a chapter to have slightly different format.

******
In few words, from chapter 1 to 11 make the changes of the first paragraph. From chapter 12 to 17 the changes are made starting from second paragraph.

But it's OK i will do them manually and if i will find the way to make it automatically... I will let you know.

Thanks again for your help!!!

geniale · 09-22-2015, 05:21 AM

It seams that was a format problem and search criteria. To clean a little more the code exported from Word how can I replace?

This is my line:
2 This is my text I want to save.

How i want to be:
2This is my text I want to save.

In other words is there a formula to group a text line? How can I delete all the spaces that I found after the number 2 of my chapters?

Thanks again for the help.

P.S. I have a bigger challenge: Is there a way to know when a paragraph number is missing?
2 - this is the format of the paragraph

09-15-2015, 02:52 PM	#16
geniale Member Posts: 14 Karma: 10 Join Date: Sep 2015 Device: none	Thanks to all of you, finally i did it!!! Thom's formula is great: <p class="calibre2"><sup class="calibre3">(.?)</sup> The group (.?) changed with ([0-9]) for 1-9, ([0-9][0-9]) for 10-99, ([0-9][0-9][0-9]) for 100 to 999*, makes the job. Wow it look nice!!! And quickly.

09-16-2015, 12:40 PM	#17
Thom* The Fumbler Posts: 66 Karma: 10 Join Date: Jun 2015 Device: android 4.2/fbreader	Another try I was so inspired by davidfor's suggestion to use Regex-Funtion mode that I delved into it and wrote my first function. It was a challenge that I could not pass up. I do believe that it works just as you requested, so I might as well share it with you. So: - Begin by selecting "Regex-Function" in the search/replace panel. - Leave the "All text files" selected. - Make sure there are no files listed in the "Function:" box so that when you hit "Create/edit" a new file will be created. - Click "Create/edit" and name the new file what you like (I named it "FixIt"). - Copy the following code to the body of the new file replacing anything that is there. Code: def replace(match, number, file_name, metadata, dictionaries, data, functions, args, kwargs): newid = ('<p id="id_' + (file_name [-10:-5:]) + ("000" + match.group(1)) [-3:] + '" class="calibre1"><sup class="calibre2">' + match.group(1) + '</sup>') return newid - Check to be sure you have 0 spaces before "def replace", 4 spaces before "newid" and 4 spaces before "return". - Click "OK" and you are ready to rock and roll. - Use the search string that I provided you previously: <p class="calibre2"><sup class="calibre3">(.?)</sup> - Hit replace and you should be in business. - It should do what you want to all files. Please let me know if this works for you or what problems you encounter. That was great fun, thanks for the challenge.

09-18-2015, 04:41 AM	#24
geniale Member Posts: 14 Karma: 10 Join Date: Sep 2015 Device: none	Thom* the function works great it amazing. Form file v3001 to v3011 goes VERY fine counting from 1 to 24. When it goes to v3012 to v3017 it seems that skips the number 1 and starts from 2 and the space where should be number 1 nothing happens. Why is that? From v3001 to v3011: <p class="block_1" id="v3001001><span class="block_2">1</span><span class="text_"> <p class="calibre2" id="v3001002"><sup class="calibre3">2</sup> .... <p class="calibre2" id="v3001017"><sup class="calibre3">17</sup> The file v3011: <p class="block_1" id="v3011001"><span class="block_2">11</span><span class="text_"> <p class="calibre2" id="v3011002"><sup class="calibre3">2</sup> <p class="calibre2" id="v3011003"><sup class="calibre3">3</sup> .... <p class="calibre2" id="v3011036"><sup class="calibre3">36</sup> From v3012 to v3017: <p class="block_1"><span class="block_2">17</span><span class="text_"> <p id="v3017002" class="calibre1"><sup class="calibre2">2</sup> <p id="v3017003" class="calibre1"><sup class="calibre2">3</sup> ..... <p id="v3017016" class="calibre1"><sup class="calibre2">16</sup> Every file begins with the number of the chapter 1-17 that has different format. The strange thing is that from 1 to 11 your function works great and from chapter 12 skips ONLY the first paragraph and starts counting from paragraph 2.

09-22-2015, 05:21 AM	#27
geniale Member Posts: 14 Karma: 10 Join Date: Sep 2015 Device: none	It seams that was a format problem and search criteria. To clean a little more the code exported from Word how can I replace? This is my line: <span class="text_chapter">2 <span class="text_1">This is my text I want to save.</span></span></p> How i want to be: <span class="text_chapter"><span class="text_1">2</span></span>This is my text I want to save.</p> In other words is there a formula to group a text line? How can I delete all the spaces that I found after the number 2 of my chapters? Thanks again for the help. P.S. I have a bigger challenge: Is there a way to know when a paragraph number is missing? <sup class="calibre3">2</sup> - this is the format of the paragraph

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
More than 3 regular expression?	james968	Conversion	1	04-04-2012 05:39 AM
Regular Expression Help	iKarampa	Calibre	13	12-15-2010 07:17 AM
Regular expression help	krendk	Calibre	4	12-04-2010 04:32 PM
Regular Expression Help	smartmart	Calibre	5	10-17-2010 05:19 AM
Help with the regular expression	Dysonco	Calibre	9	03-22-2010 10:45 PM

09-17-2015, 01:55 AM	#18
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	that is interesting, could you talk us through it please. I can intuit what most of the code is doing, but what is the significance of needing exactly 4 spaces in two locations?

09-17-2015, 03:12 AM	#21
eschwartz Ex-Helpdesk Junkie Posts: 19,422 Karma: 85397180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	https://stackoverflow.com/questions/...slice-notation

09-18-2015, 09:10 AM	#25
Thom* The Fumbler Posts: 66 Karma: 10 Join Date: Jun 2015 Device: android 4.2/fbreader	Wow, this looks like a whole different scenario and I can't quite follow your question. If the search and replace is skipping the first paragraph on some files, probably the search criteria is not met. Test that by simply running the search (skip the replace) and see if the paragraph is found. It is common in ebooks for the first paragraph of a chapter to have slightly different format.