MobileRead Forums - View Single Post - A new feature proposal: report of group saved searches

Tex2002ans · 03-20-2017, 08:10 PM

Quote:

Originally Posted by DiapDealer

You still haven't explained how. Unless you already know what the numbers are "supposed" to be, how will seeing them (broken down or lumped together) speed things up?

I could see how roger64's recommendation could be helpful.

Here is a few of the use-cases I can think of where this would be helpful.

In almost every EPUB I mass convert footnotes from ## form into [##] form. It would be nice to see something like:

Code:

Fix Footnote <sup>##</sup> -> [##]		102
Fix Endnote <sup>##</sup> -> [##]		100

Currently, if I ran the entire group, I would just get a "Replacements made: 202". If there was a mismatch between the two, then I know that there is an issue I need to look into. Maybe there was a footnote 99a OR an OCR error along the line.

I also have "Finereader Cleanup" group of saved searches to clean some cruft Finereader produces. Here are a few:

Split Double Footnote

Search: ([0-9]+), ([0-9]+)
Replace: \1,\2

Fix Bold Smallcaps

Search: 
Replace: 

Clean Italic &

Search: &
Replace: &

On the last book I worked on, if I run the entire group, it says "Replacements made: 1072". But if I run each Regex individually, and say "Count All", I would get a helpful breakdown like this:

Code:

Fix Italics 				403
Fix Bold 				6
Fix Bold/Italics 			110
Fix Smallcaps 				25
Fix Bold/Smallcaps			1
Clean Italic &				0
Split Double Footnote			0
Fix Finereader 12 Table Alignment	198
Clean Bold td				0
Clean Italics td			29
Clean td				298
Clean Table Headers			2

This could let me know of a potential issue to look out for in this specific EPUB.

For example, if there was 1 "Double Footnote", I know that I have to look more closely when creating footnote links back/forth OR it could have been an OCR error.

Or if I get a hit on "Clean Italic &" I know that I have to go looking more closely. 99% of the time an italic ampersand is either NOT italic OR Finereader just didn't like the specific font used OR it was an actual OCR error. In the very rare case though, the ampersand might have been smack dab in the middle of a book title and the italic spaces around it were missed:

Code:

<i>Hansel</i> <i>&amp;</i> <i>Gretel</i>

would accidentally be corrected to this:

Code:

<i>Hansel</i> &amp; <i>Gretel</i>

If I saw 1 hit, I would then know to go searching for it and change it to this:

Code:

<i>Hansel &amp; Gretel</i>

With one journal I worked on, I came up with a group of 25 Regexes (cleaning up stuff like dropcaps, normalizing code for figures/images/captions, converting the occasional theta image -> Θ. [...]).

Having a breakdown of the number of each fix would have also been helpful way back when:

"I know there are 10 articles and 10 dropcaps? 25 figures and 25 corrections? Good, now I don't have to look at it."