Collations Supported

Each character set supported in SingleStore can have multiple collations with one default collation per character set.

A collation determines the ordering of characters in a character set and is used for sorting and comparison of columns defined to use that collation. Collations determine if case and accents are used in ordering; for example, if 'Apple' is equal to 'apple'. Collations also determine which characters are considered equal; for example, if 'køhler' is equal to 'kohler'.

To view the available collations and the character sets to which they apply, use the SHOW COLLATION command. Alternatively, query the COLLATIONS information schema view as shown in the following example.

SELECT *
FROM INFORMATION_SCHEMA.COLLATIONS
WHERE CHARACTER_SET_NAME = 'utf8mb4' LIMIT 5;
+----------------------+--------------------+-----+------------+-------------+---------+
| COLLATION_NAME       | CHARACTER_SET_NAME | ID  | IS_DEFAULT | IS_COMPILED | SORTLEN |
+----------------------+--------------------+-----+------------+-------------+---------+
| utf8mb4_general_ci   | utf8mb4            |  45 | Yes        | Yes         |       1 |
| utf8mb4_bin          | utf8mb4            |  46 |            | Yes         |       1 |
| utf8mb4_unicode_ci   | utf8mb4            | 224 |            | Yes         |       8 |
| utf8mb4_icelandic_ci | utf8mb4            | 225 |            | Yes         |       8 |
| utf8mb4_latvian_ci   | utf8mb4            | 226 |            | Yes         |       8 |
+----------------------+--------------------+-----+------------+-------------+---------+

When a character set is used, the default collation of the character set is automatically assigned unless otherwise specified. In the preceding command output, default collations are indicated by the Yes value in the IS_DEFAULT field. The default collation for each character set can be retrieved from the CHARACTER_SETS information schema view, or from the results of the SHOW COLLATION or the SHOW CHARACTER SET command. Note that a collation is applicable to only one character set.

By default, SingleStore uses the utf8 character set along with its default collation utf8_general_ci. If the collation and character set are not set explicitly, newly-created text columns use the value of collation_server, which is by default utf8_general_ci collation.

SingleStore recommends using binary collations, such as utf8mb4_bin, for best performance.  Refer to Choosing a Collation for more information.

Retrieve the value of the collation in current use by using the @@ in a SELECT for one of the following collation variables: collation_server, collation_connection, and collation_database.

SELECT @@character_set_server;
+------------------------+
| @@character_set_server |
+------------------------+
| utf8                   |
+------------------------+
SELECT @@collation_server;
+--------------------+
| @@collation_server |
+--------------------+
| utf8_general_ci    |
+--------------------+

To view the mapping between collations and character sets, use the following command.

SELECT *
FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY;

Note: When used to specify a collation, binary must be surrounded by backticks (`binary`) or single quotes ('binary') as shown in the following examples.

CREATE TABLE t(a TEXT COLLATE binary);
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'binary)' at line 1
CREATE TABLE t(a TEXT COLLATE 'binary');
Query OK, 0 rows affected (0.03 sec)

Types of Collations and Naming Conventions

Collations differ in whether they are case-sensitive or -insensitive, accent-sensitive or -insensitive, and whether they consider characters (e.g. o and ø) equal or not.

Collation names are prefixed with their associated character set name, typically followed by one or more suffixes representing the collation properties. For example, language-specific collations such as utf8_swedish_ci contain the language as a suffix and then the _ci suffix to indicate case-insensitivity.

SingleStore supports case-insensitive (_ci) collations. Case-sensitive (_cs), accent-insensitive (_ai), and accent-sensitive (_as) collations are not supported. For the collations SingleStore supports, a collation's accent-sensitivity is determined by its case-sensitivity designation, _ci in the name implies _ai.

The _bin suffix indicates a binary collation for a specific character set, which is different from the binary collation. For example, utf8mb4_bin is a binary collation for the utf8mb4 character set.

The binary collation is the only collation for the binary character set. The binary collation defines rules for character comparisons based on the byte values of the characters.

In contrast, collations of character sets with the _bin suffix define comparisons based on numeric character code values. Numeric character code values differ from byte values for multibyte characters. Additionally, the utf8mb4 charset follows UTF8 rules, while the binary charset stores any binary values.

Choosing a Collation

Collations have an impact on performance. Binary, case-sensitive collations have the fastest performance, while collations that treat upper case and lower case as equal (case-insensitive) and those that treat different representations of a character as equal (e.g. o = ø) have slower performance.

SingleStore recommends using binary collations, such as utf8mb4_bin, for best performance. Comparisons using _bin collations compare using Unicode values, which is faster than comparisons using other types of collations which apply linguistic rules. This performance difference impacts lookups, joins, filters, and grouping.

However, binary collations are case-sensitive, accent-sensitive, and do not account for different representations of certain characters. For example, with a binary collation 'apple' != 'Apple' and 'køhler' != 'kohler'.

Binary collations support the same Unicode characters as their corresponding Unicode collation, but the comparison rules are different. For example, the binary collation utf8mb4_bin supports the same Unicode characters as utf8mb4_unicode_ci.

General case-insensitive collations (*_general_ci), and Unicode collations (*_unicode_ci) provide more sophisticated collation algorithms than binary collations (*_bin) at the cost of performance.

For any Unicode character set, the _general_ci collations are faster than the _unicode_ci collations. The utf8mb4_general_ci collation does not fully inherit Unicode rules for all languages, but is case-insensitive.

For example, comparisons for the utf8mb4_general_ci collation are faster, but less precise, than comparisons for utf8mb4_unicode_ci. The utf8mb4_unicode_ci collation supports mappings such as expansions; that is, when one character is considered equal to combinations of other characters. For example, ß is equal to ss in German, and o is equal to ø in some Scandinavian languages.

Thus for the utf8mb4 character set:

  • utf8mb4_bin is the fastest, but is case-sensitive and does not support expansions.

  • utf8mb4_general_ci is slower, and is case-insensitive, but does not support expansions.

  • utf8mb4_unicode_ci is the slowest, it is case-insensitive and supports expansions.

SingleStore recommends using case-insensitive, accent-insensitive, and Unicode collations only when explicitly needed by the application.

Unicode Collation Behavior

SingleStore's utf8mb4_unicode_ci collation uses the Unicode Collation Algorithm (UCA) version 13.0. Databases that implement the _unicode_ci collations according to prior UCA versions may have slightly different collation behavior. For example, in the current UCA version the characters o and ø are considered equal, but this was not true in prior versions including the UCA used by MySQL.

Last modified:

Was this article helpful?

Verification instructions

Note: You must install cosign to verify the authenticity of the SingleStore file.

Use the following steps to verify the authenticity of singlestoredb-server, singlestoredb-toolbox, singlestoredb-studio, and singlestore-client SingleStore files that have been downloaded.

You may perform the following steps on any computer that can run cosign, such as the main deployment host of the cluster.

  1. (Optional) Run the following command to view the associated signature files.

    curl undefined
  2. Download the signature file from the SingleStore release server.

    • Option 1: Click the Download Signature button next to the SingleStore file.

    • Option 2: Copy and paste the following URL into the address bar of your browser and save the signature file.

    • Option 3: Run the following command to download the signature file.

      curl -O undefined
  3. After the signature file has been downloaded, run the following command to verify the authenticity of the SingleStore file.

    echo -n undefined |
    cosign verify-blob --certificate-oidc-issuer https://oidc.eks.us-east-1.amazonaws.com/id/CCDCDBA1379A5596AB5B2E46DCA385BC \
    --certificate-identity https://kubernetes.io/namespaces/freya-production/serviceaccounts/job-worker \
    --bundle undefined \
    --new-bundle-format -
    Verified OK

Try Out This Notebook to See What’s Possible in SingleStore

Get access to other groundbreaking datasets and engage with our community for expert advice.