Implicit Collation in Special Cases

By default, SingleStore uses utf8 (alias for utf8mb3) and utf8_general_ci for character_set_server and collation_server respectively. When character_set_server is set to utf8, SingleStore accepts characters that use 4-byte encoding, stores them, and displays them in the output. However, as 4-byte characters are invalid in utf8, the string literals are implicitly assigned binary collation and processed as a sequence of bytes rather than characters. This implicit conversion to binary collation causes string functions and operations to return unexpected results.

For example, the CHARACTER_LENGTH string function returns the following output when used against a string literal with 4-byte encoding.

SHOW VARIABLES like "collation%";
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server     | utf8_general_ci |
+----------------------+-----------------+
3 rows in set (0.001 sec)

select character_length("敥"), character_length("😀");
+-------------------------+--------------------------+
| character_length("敥")  | character_length("😀")    |
+-------------------------+--------------------------+
|                       1 |                        4 |
+-------------------------+--------------------------+
1 row in set (0.092 sec)

In the above example, SingleStore tries to interpret the provided string literal using utf8_general_ci. As the string literal contains invalid characters for the current collation, SingleStore implicitly assigns a binary collation to the string literal and processes it as a sequence of bytes. This implicit use of binary collation causes CHARACTER_LENGTH("😀") to return 4 instead of 1 (as it would be if character_set_server was utf8mb4), where 4 is a size in bytes of the string literal encoding.

To avoid the implicit use of binary collation for string literals in string functions, either use explicit type casting or use database columns defined with the utf8mb4 character set. The following example uses explicit typecasting of the string literal.

SELECT character_length("😀" :> CHAR(20) COLLATE utf8mb4_unicode_ci) as result;
+--------+
| result |
+--------+
|      1 |
+--------+

Alternatively, you can use a database column defined with the utf8mb4 character set.

CREATE TABLE t(a VARCHAR(200) COLLATE utf8mb4_unicode_ci);
Query OK, 0 rows affected (0.329 sec)

INSERT INTO t VALUES("Hi 😀!");
Query OK, 1 row affected (0.084 sec)

SELECT * FROM t;
+--------+
| result |
+--------+
|      1 |
+--------+
1 row in set (0.127 sec)

SELECT a, character_length(a) AS len FROM t;
+----------+------+
| a        | len  |
+----------+------+
| Hi 😀!    |    5 |
+----------+------+