MobileRead Forums - View Single Post

kurochka · 05-20-2009, 05:23 PM

Quote:

Originally Posted by Sunlite

There are two problems with the regex feature in Notepad++:

1) reg expression can only search each line separately. There is no search for multi line.
--> This is no problem for the regex from pepak if you remove hard line breaks inside the paragraphs.

O-o-oh! Multiline regex is a genie that is hard to keep in the bottle. I presume you are a pro and it's no problem for you, but for newbies in this thread I would like to put out a caution. Try to avoid (in most cases it is possible) multiline regex (where . or another wildcard can represent any other character including a new line) because it can bite and you will notice only when it is too late. I cannot even remember when the last time I turned on or needed multiline regex. Even if you must use it, try to limit the number of lines searched to the minimum necessary, and make an audit after each such replacement to make sure than there are no undesired consequences. Alternatively, do not use replace all, but go through the text one search at a time.

I was trying to think of some generalized regex that I can share but it appears that all my strings are so specific to the problem at hand that I cannot think of anything with wide application. As an example I have a text with both English and Ukrainian text and complex structure. If its an OCR, I may save it first to Word (it can search for formatting such as color, font size, italics, etc.), I would search for symbols that are used in tags (e.g., <>) and replace them with (\<, \>), then I'll put in formatting tags in Word (italics, bold, fonts, color, if necessary). An alternative would be to save OCR into html but I have found that the html conversion often creates such a mess with text and unnecessary for me tagging that I prefer to do it as described above.

Then I open the text in a text editor (emeditor in my case, it's the best out there). Typically, I analyze text before doing anything else, looking for patterns. I start with some simple replacements that would make the pattern more uniform. Even if there are tags or lines, etc. that will not be ultimately necessary, I try to keep them for now to see if they reveal something about the pattern that I can later use in my regexes. Given that I work with two languages English and Ukrainian, there are lots of OCR mistakes mixing Latin and Cyrillic so I use ranges such as [a-zàâçéèêëîïôûùüÿœæ] and [а-яґєії] to separate the two. At each step, I try not to make an irreversible mistake. For this reason, every once in a while I make a new version of the document and keep the old as a backup to be able to revert to it if I do screw up with something. It is easy to screw up when you have several hundred thousand lines.