MobileRead Forums - View Single Post - SQLite: Case-insensitive matching of Unicode characters

DaltonST · 07-29-2014, 04:04 PM

According to SQLite Frequently Asked Question #18 at http://sqlite.org/faq.html#q18 :

(18) Case-insensitive matching of Unicode characters does not work.
The default configuration of SQLite only supports case-insensitive comparisons of ASCII characters. The reason for this is that doing full Unicode case-insensitive comparisons and case conversions requires tables and logic that would nearly double the size of the SQLite library. The SQLite developers reason that any application that needs full Unicode case support probably already has the necessary tables and functions and so SQLite should not take up space to duplicate this ability. Instead of providing full Unicode case support by default, SQLite provides the ability to link against external Unicode comparison and conversion routines. The application can overload the built-in NOCASE collating sequence (using sqlite3_create_collation()) and the built-in like(), upper(), and lower() functions (using sqlite3_create_function()). The SQLite source code includes an "ICU" extension that does these overloads.

So, COLLATE NOCASE in a SQLite table definition or in a SELECT is only good for pure ASCII comparisons. Unless, of course, what is described above has been implemented.

Does anyone know if Calibre's SQLite has already been implemented with Unicode UTF-8 case insensitive matching as described above?

For example, Calibre would need this capability when searching for Tags in Unicode UTF-8 that have very non-ASCII characters, such as in the German word sachbüch, the Hindi word NAHĪMṀ, the Spanish word noficción, the Turkish word gerçek, and so forth. Ditto for Authors, Title, and Series.

Thanks in advance.

07-29-2014, 04:04 PM	#1
DaltonST Deviser Posts: 2,265 Karma: 2090983 Join Date: Aug 2013 Location: Texas Device: none	SQLite: Case-insensitive matching of Unicode characters According to SQLite Frequently Asked Question #18 at http://sqlite.org/faq.html#q18 : (18) Case-insensitive matching of Unicode characters does not work. The default configuration of SQLite only supports case-insensitive comparisons of ASCII characters. The reason for this is that doing full Unicode case-insensitive comparisons and case conversions requires tables and logic that would nearly double the size of the SQLite library. The SQLite developers reason that any application that needs full Unicode case support probably already has the necessary tables and functions and so SQLite should not take up space to duplicate this ability. Instead of providing full Unicode case support by default, SQLite provides the ability to link against external Unicode comparison and conversion routines. The application can overload the built-in NOCASE collating sequence (using sqlite3_create_collation()) and the built-in like(), upper(), and lower() functions (using sqlite3_create_function()). The SQLite source code includes an "ICU" extension that does these overloads. So, COLLATE NOCASE in a SQLite table definition or in a SELECT is only good for pure ASCII comparisons. Unless, of course, what is described above has been implemented. Does anyone know if Calibre's SQLite has already been implemented with Unicode UTF-8 case insensitive matching as described above? For example, Calibre would need this capability when searching for Tags in Unicode UTF-8 that have very non-ASCII characters, such as in the German word sachbüch, the Hindi word NAHĪMṀ, the Spanish word noficción, the Turkish word gerçek, and so forth. Ditto for Authors, Title, and Series. Thanks in advance. Last edited by DaltonST; 07-29-2014 at 08:51 PM.