Character Sets Supported

SingleStore supports a variety of character sets in the Unicode standard and their associated collations. To view the list of character sets supported, run the SHOW CHARACTER SETS command. This will display the character sets along with their default collation and the maximum byte length of the characters within each character set.

SHOW CHARACTER SET;
+---------+-----------------------+--------------------+--------+
| Charset | Description           | Default collation  | Maxlen |
+---------+-----------------------+--------------------+--------+
| utf8mb4 | UTF-8 Unicode         | utf8mb4_general_ci |      4 |
| utf8    | UTF-8 Unicode         | utf8_general_ci    |      3 |
| binary  | Binary pseudo charset | binary             |      1 |
+---------+-----------------------+--------------------+--------+

Alternatively, you can retrieve the supported character sets from the CHARACTER_SETS view by using a SELECT statement with optional LIKE and WHERE clauses.

SELECT * FROM INFORMATION_SCHEMA.CHARACTER_SETS WHERE CHARACTER_SET_NAME = 'utf8mb4';
+--------------------+----------------------+---------------+--------+
| CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION   | MAXLEN |
+--------------------+----------------------+---------------+--------+
| utf8mb4            | utf8mb4_general_ci   | UTF-8 Unicode |      4 |
+--------------------+----------------------+---------------+--------+

Character Sets Supported by SingleStore Features

binary

A character set used for encoding binary strings. This character set has binary as the default collation.

Important

The binary character set is a universal feature that is supported across most applicable database schema objects and commands.

utf8

An alias for utf8mb3, which is a Unicode character set that supports encoding of characters using 1 to 3 bytes per character. This character set is used for encoding the characters in the BMP. utf8_general_ci is the default collation assigned to this character set.

Important

The utf8 character set is a universal feature that is supported across most applicable database schema objects and commands.

utf8mb4

A Unicode character set that supports encoding of characters using 1 to 4 bytes per character.  This character set is used for encoding all the characters in the BMP and supplementary characters that lie outside the BMP, including the private use area (PUA) which can contain pictographic symbols (emojis) and ancient scripts, such as Egyptian hieroglyphs. utf8mb4_general_ci is the default collation assigned to this character set.

utf8mb4 is supported for specific database schema objects and commands that are discussed in the following sections.

Data Types

The following data types allow you to store utf8mb4 Unicode characters.

  • JSON

  • CHAR

  • VARCHAR

  • LONGTEXT,MEDIUMTEXT, TEXT, TINYTEXT

  • ENUM

  • SET

String Functions

String Functions can be used with strings with the utf8mb4 character set. For example, the LENGTH string function returns the number of bytes in a string that uses the utf8mb4 character set.

select LENGTH('Hello world!🙂');
+----------------------------+
| LENGTH('Hello world!🙂')   |
+----------------------------+
|                         16 |
+----------------------------+

JSON Functions

JSON Functions can be used with JSON columns and string arguments with the utf8mb4 character set. For example, the JSON_AGG function aggregates a JSON column that supports the utf8mb4 character set.

CREATE TABLE events (name VARCHAR (20), registrations INT, comments JSON COLLATE utf8mb4_general_ci);
INSERT events VALUES ("Swimming",50,'{"Registration closed":"✅"}'), ("Biking",28,'{"Registration is open":"⏸"}'), ("Powerlifting",22,'{"Registration is open":"⏸"}');
SELECT JSON_AGG(comments) FROM events;
+-----------------------------------------------------------------------------------------------+
| JSON_AGG(comments)                                                                            |
+-----------------------------------------------------------------------------------------------+
| [{"Registration is open":"⏸"},{"Registration closed":"✅"},{"Registration is open":"⏸"}]     |
+-----------------------------------------------------------------------------------------------+
1 row in set (0.05 sec)

Procedural Extensions

In procedural extensions such as stored procedures and user-defined functions, you can use parameters and variables withutf8mb4 Unicode characters. In addition, the tables and columns introduced in procedural extensions can store utf8mb4 Unicode characters.

SingleStore Pipelines

SingleStore Pipelines can ingest and process data with the utf8mb4 character set from the supported data sources. The columns that will store the ingested data must be configured to support the utf8mb4 character set.

LOAD DATA

The LOAD DATA statement allows you to import files with any supported character set, including utf8mb4, into SingleStore. The columns that will store the imported data must be configured to support the utf8mb4 character set.

Last modified: January 8, 2024

Was this article helpful?