MobileRead Forums - View Single Post

DiapDealer · 10-03-2013, 09:33 PM

Quote:

Originally Posted by user_none

Hooking in the regex stuff would be the hardest part but python has very strong regex (search and replace) support. You'd essentially get the data from the editor, run the search, get the offsets and highlight that area of the editor.

Python's standard re module leaves a lot to be desired with regard to non-ASCII text. Matthew Barnett's regex module (available from PyPi) does a much better job (and will hopefully replace re eventually) and is very PCRE-ish. But even then--as much as I know traditional Python fans might not like hearing it--I probably wouldn't start a new python, cross-platform epub editor project (that will be dealing extensively with unicode) with Python 2.x.x. I'd begin with Python 3.3 right from the get-go.

From the Python 3.3 release-notes:

Quote:

Changes introduced by PEP 393 are the following:

Python now always supports the full range of Unicode codepoints, including non-BMP ones (i.e. from U+0000 to U+10FFFF). The distinction between narrow and wide builds no longer exists and Python now behaves like a wide build, even under Windows.
With the death of narrow builds, the problems specific to narrow builds have also been fixed, for example:
len() now always returns 1 for non-BMP characters, so len('\U0010FFFF') == 1;
surrogate pairs are not recombined in string literals, so '\uDBFF\uDFFF' != '\U0010FFFF';
indexing or slicing non-BMP characters returns the expected value, so '\U0010FFFF'[0] now returns '\U0010FFFF' and not '\uDBFF';
all other functions in the standard library now correctly handle non-BMP codepoints.
The value of sys.maxunicode is now always 1114111 (0x10FFFF in hexadecimal). The PyUnicode_GetMax() function still returns either 0xFFFF or 0x10FFFF for backward compatibility, and it should not be used with the new Unicode API (see issue 13054).
The ./configure flag --with-wide-unicode has been removed.

Concerning Barnett's regex module:

Quote:

The regex library therefore much more closely follows the (current) recommendations of UTS#18: Unicode Regular Expressions in how it approaches things. It meets or exceeds the UTS#18 Level 1 requirements in most if not all regards, something you normally have to use the ICU regex library or Perl itself for — or if you are especially courageous, the new Java 7 update to its regexes, as that also conforms to the Level One requirements from UTS#18.

Beyond meeting those Level One requirements, which are all absolutely essential for basic Unicode support but which are not met by Python’s current re library, the regex library also meets the Level Two requirements for RL2.5 Named Characters (\N{...})), RL2.2 Extended Grapheme Clusters (\X), and the new RL2.7 on Full Properties from revision 14 of UTS#18.

Matthew’s regex module also does Unicode casefolding so that case insensitive matches work reliably on Unicode, which re does not.

But that's just me.