MobileRead Forums - View Single Post - Regex-function to merge endnotes files in editor

EbookMakers · 11-23-2020, 10:05 PM

A test epub is attached to the lead post of this topic. We can think of two solutions. A solution for well-behaved people like you and even me, and a solution for rascals. They use the same regex:

Code:

<body[^\n]*\n\K\s*(<h[^>]*>[^<]*</h\d>)?\s*<dl[^>]*>\s*<dt[^>]*>\[<a\b(?:(?!</dl).)+</dl>\s*(?=</body>)

The \K switch resets the selection. The expression placed before the switch is equivalent to a positive backward assertion. I use it, for my own reasons, to maintain compatibility with the PCRE engine which does not accept variable length back assertions as it does here.

On an epub respecting the html syntax resulting from a docx -> epub conversion, the regex selects:

- the note in files containing one note and only one according to the syntax of the conversion, ensuring that the note is surrounded by the pair of body tags.
- in optional group 1, the title preceding the 1st note only (after the conversion).

The regex successively selects the solitary notes which respect the syntax of the conversion. It therefore also allows you to know the name of the xhtml files which contain them. Asking the regex for counting would tell if the epub is affected by the purpose of the regex-function. Merging of notes should only be requested if there are at least two notes. If group 1 exists, the file contains the 1st note.

We cannot predict on which (active) file the regex will start. We can ask that it browse the files in the “spine” order with the parameter:
replace.file_order = 'spine'

We only know that the occurrence for which group 1 exists is the 1st note. Both solutions rely on this characteristic to obtain a file with the notes starting with the 1st note and then in the correct order. Otherwise, as stated in a previous message, the order of the notes in the result file would depend on the active file when launching the regex.

One argument to the replace function is “data”, which is a persistent ׅ “dic” during the execution of the function. Our two functions store their information in this dic.

It is possible to request that the function be executed a last time after the last occurrence:
replace.call_after_last_match = True

It is in this last time that the merge will be requested. Merge updates notes calls in the text and the opf file (since it deletes files). The display must then be updated in the editor as written above by Kovid:
get_boss (). apply_container_update_to_gui ()

A major problem is that the result of the regex-function comes from the “return” of the “replace” function, even though the merge is executed after processing the last occurrence! One would have expected that the result of the regex-function would come from the "merge". The main difference between the two solutions is how to work around this problem.

Both functions are commented.

11-23-2020, 10:05 PM	#7
EbookMakers Enthusiast Posts: 26 Karma: 38 Join Date: Nov 2019 Location: Paris, France Device: none	A test epub is attached to the lead post of this topic. We can think of two solutions. A solution for well-behaved people like you and even me, and a solution for rascals. They use the same regex: Code: <body[^\n]\n\K\s(<h[^>]>[^<]</h\d>)?\s<dl[^>]>\s<dt[^>]>\[<a\b(?:(?!</dl).)+</dl>\s(?=</body>) The \K switch resets the selection. The expression placed before the switch is equivalent to a positive backward assertion. I use it, for my own reasons, to maintain compatibility with the PCRE engine which does not accept variable length back assertions as it does here. On an epub respecting the html syntax resulting from a docx -> epub conversion, the regex selects: - the note in files containing one note and only one according to the syntax of the conversion, ensuring that the note is surrounded by the pair of body tags. - in optional group 1, the title preceding the 1st note only (after the conversion). The regex successively selects the solitary notes which respect the syntax of the conversion. It therefore also allows you to know the name of the xhtml files which contain them. Asking the regex for counting would tell if the epub is affected by the purpose of the regex-function. Merging of notes should only be requested if there are at least two notes. If group 1 exists, the file contains the 1st note. We cannot predict on which (active) file the regex will start. We can ask that it browse the files in the “spine” order with the parameter: replace.file_order = 'spine' We only know that the occurrence for which group 1 exists is the 1st note. Both solutions rely on this characteristic to obtain a file with the notes starting with the 1st note and then in the correct order. Otherwise, as stated in a previous message, the order of the notes in the result file would depend on the active file when launching the regex. One argument to the replace function is “data”, which is a persistent ׅ “dic” during the execution of the function. Our two functions store their information in this dic. It is possible to request that the function be executed a last time after the last occurrence: replace.call_after_last_match = True It is in this last time that the merge will be requested. Merge updates notes calls in the text and the opf file (since it deletes files). The display must then be updated in the editor as written above by Kovid: get_boss (). apply_container_update_to_gui () A major problem is that the result of the regex-function comes from the “return” of the “replace” function, even though the merge is executed after processing the last occurrence! One would have expected that the result of the regex-function would come from the "merge*". The main difference between the two solutions is how to work around this problem. Both functions are commented.