Collations Supported

A collation determines the ordering of characters in a character set and is used for sorting and comparison of columns defined to use that collation. Collations determine if case and accents are used in ordering; for example, if 'Apple' is equal to 'apple'. Collations also determine which characters are considered equal; for example, if 'køhler' is equal to 'kohler'.

To view the available collations and the character sets to which they apply, use the SHOW COLLATION command. Alternatively, query the COLLATIONS information schema view as shown in the following example.

SQL

SELECT *
FROM INFORMATION_SCHEMA.COLLATIONS
WHERE CHARACTER_SET_NAME = 'utf8mb4' LIMIT 5;

+----------------------+--------------------+-----+------------+-------------+---------+
| COLLATION_NAME       | CHARACTER_SET_NAME | ID  | IS_DEFAULT | IS_COMPILED | SORTLEN |
+----------------------+--------------------+-----+------------+-------------+---------+
| utf8mb4_general_ci   | utf8mb4            |  45 | Yes        | Yes         |       1 |
| utf8mb4_bin          | utf8mb4            |  46 |            | Yes         |       1 |
| utf8mb4_unicode_ci   | utf8mb4            | 224 |            | Yes         |       8 |
| utf8mb4_icelandic_ci | utf8mb4            | 225 |            | Yes         |       8 |
| utf8mb4_latvian_ci   | utf8mb4            | 226 |            | Yes         |       8 |
+----------------------+--------------------+-----+------------+-------------+---------+

When a character set is used, the default collation of the character set is automatically assigned unless otherwise specified. In the preceding command output, default collations are indicated by the Yes value in the IS_DEFAULT field.

The default collation for each character set can be retrieved from the CHARACTER_SETS information schema view, or from the results of the SHOW COLLATION or the SHOW CHARACTER SET command. Note that a collation is applicable to only one character set.

By default, SingleStore uses the utf8 character set along with its default collation utf8_general_ci. If the collation and character set are not set explicitly, newly-created text columns use the value of collation_server, which is by default utf8_general_ci collation.

SingleStore recommends using binary collations, such as utf8mb4_bin, for best performance. Refer to Choosing a Collation for more information.

Retrieve the value of the collation in current use by using the @@ in a SELECT for one of the following collation variables: collation_server, collation_connection, and collation_database.

SQL

SELECT @@character_set_server;

+------------------------+
| @@character_set_server |
+------------------------+
| utf8                   |
+------------------------+

SQL

SELECT @@collation_server;

+--------------------+
| @@collation_server |
+--------------------+
| utf8_general_ci    |
+--------------------+

To view the mapping between collations and character sets, use the following command.

SQL

SELECT *
FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY;

Note

When used to specify a collation, binary must be surrounded by backticks (`binary`) or single quotes ('binary') as shown in the following examples.

SQL

CREATE TABLE t(a TEXT COLLATE binary);

ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'binary)' at line 1

SQL

CREATE TABLE t(a TEXT COLLATE 'binary');

Query OK, 0 rows affected (0.03 sec)

Types of Collations and Naming Conventions

Collations differ in whether they are case-sensitive or -insensitive, accent-sensitive or -insensitive, and whether they consider characters (e.g. o and ø) equal or not.

Collation names are prefixed with their associated character set name, typically followed by one or more suffixes representing the collation properties. For example, language-specific collations such as utf8_swedish_ci contain the language as a suffix and then the _ci suffix to indicate case-insensitivity.

SingleStore supports case-insensitive (_ci) collations. Case-sensitive (_cs), accent-insensitive (_ai), and accent-sensitive (_as) collations are not supported. For the collations that SingleStore supports, a collation's accent-sensitivity is determined by its case-sensitivity designation, _ci in the name implies _ai.

The _bin suffix indicates a binary collation for a specific character set, which is different from the binary collation. For example, utf8mb4_bin is a binary collation for the utf8mb4 character set.

The binary collation is the only collation for the binary character set. The binary collation defines rules for character comparisons based on the byte values of the characters.

In contrast, collations of character sets with the _bin suffix define comparisons based on numeric character code values. Numeric character code values differ from byte values for multibyte characters. Additionally, the utf8mb4 character set follows UTF8 rules, while the binary character set stores any binary values.

Choosing a Collation

Collations have an impact on performance. Binary, case-sensitive collations have the fastest performance, while collations that treat upper case and lower case as equal (case-insensitive) and those that treat different representations of a character as equal (e.g. o = ø) have slower performance.

SingleStore recommends using binary collations, such as utf8mb4_bin, for best performance. Comparisons using _bin collations compare using Unicode values, which is faster than comparisons using other types of collations which apply linguistic rules. This performance difference impacts lookups, joins, filters, and grouping.

However, binary collations are case-sensitive, accent-sensitive, and do not account for different representations of certain characters. For example, with a binary collation 'apple' != 'Apple' and 'køhler' != 'kohler'.

Binary collations support the same Unicode characters as their corresponding Unicode collation, but the comparison rules are different. For example, the binary collation utf8mb4_bin supports the same Unicode characters as utf8mb4_unicode_ci.

General case-insensitive collations (*_general_ci), and Unicode collations (*_unicode_ci) provide more sophisticated collation algorithms than binary collations (*_bin) at the cost of performance.

For any Unicode character set, the _general_ci collations are faster than the _unicode_ci collations. The utf8mb4_general_ci collation does not fully inherit Unicode rules for all languages, but is case-insensitive.

For example, comparisons for the utf8mb4_general_ci collation are faster, but less precise, than comparisons for utf8mb4_unicode_ci. The utf8mb4_unicode_ci collation supports mappings such as expansions; that is, when one character is considered equal to combinations of other characters. For example, ß is equal to ss in German, and o is equal to ø in some Scandinavian languages.

Thus for the utf8mb4 character set:

utf8mb4_bin is the fastest, but is case-sensitive and does not support expansions.
utf8mb4_general_ci is slower, and is case-insensitive, but does not support expansions.
utf8mb4_unicode_ci is the slowest, it is case-insensitive and supports expansions.

SingleStore recommends using case-insensitive, accent-insensitive, and Unicode collations only when explicitly needed by the application.

Unicode Collation Behavior

SingleStore's utf8mb4_unicode_ci collation uses the Unicode Collation Algorithm (UCA) version 13.0. Databases that implement the _unicode_ci collations according to prior UCA versions may have slightly different collation behavior. For example, in the current UCA version the characters o and ø are considered equal, but this was not true in prior versions including the UCA used by MySQL.

Collations Supported

On this page

Types of Collations and Naming Conventions

Choosing a Collation

Unicode Collation Behavior

Was this article helpful?

On this page

Was this article helpful?