Character Sets Supported
SingleStore supports a variety of character sets in the Unicode standard and their associated collations. To view the list of character sets supported, run the SHOW CHARACTER SETS
command. This will display the character sets along with their default collation and the maximum byte length of the characters within each character set.
SHOW CHARACTER SET; **** +---------+-----------------------+--------------------+--------+ | Charset | Description | Default collation | Maxlen | +---------+-----------------------+--------------------+--------+ | utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci | 4 | | utf8 | UTF-8 Unicode | utf8_general_ci | 3 | | binary | Binary pseudo charset | binary | 1 | +---------+-----------------------+--------------------+--------+
Alternatively, you can retrieve the supported character sets from the CHARACTER_SETS
view by using a SELECT
statement with optional LIKE
and WHERE
clauses.
SELECT * FROM INFORMATION_SCHEMA.CHARACTER_SETS WHERE CHARACTER_SET_NAME = 'utf8mb4'; **** +--------------------+----------------------+---------------+--------+ | CHARACTER_SET_NAME | DEFAULT_COLLATE_NAME | DESCRIPTION | MAXLEN | +--------------------+----------------------+---------------+--------+ | utf8mb4 | utf8mb4_general_ci | UTF-8 Unicode | 4 | +--------------------+----------------------+---------------+--------+
Character Sets Supported by SingleStore Features
binary
A character set used for encoding binary strings. This character set has binary
as the default collation.
Important
The binary character set is a universal feature that is supported across most applicable database schema objects and commands.
utf8
An alias for utf8mb3
, which is a Unicode character set that supports encoding of characters using 1 to 3 bytes per character. This character set is used for encoding the characters in the BMP. utf8_general_ci
is the default collation assigned to this character set.
Important
The utf8
character set is a universal feature that is supported across most applicable database schema objects and commands.
utf8mb4
A Unicode character set that supports encoding of characters using 1 to 4 bytes per character. This character set is used for encoding all the characters in the BMP and supplementary characters that lie outside the BMP, including the private use area (PUA) which can contain pictographic symbols (emojis) and ancient scripts, such as Egyptian hieroglyphs. utf8mb4_general_ci
is the default collation assigned to this character set.
utf8mb4
is supported for specific database schema objects and commands that are discussed in the following sections.
Data Types
The following data types allow you to store utf8mb4
Unicode characters.
JSON
CHAR
VARCHAR
LONGTEXT
,MEDIUMTEXT
,TEXT
,TINYTEXT
ENUM
SET
String Functions
String Functions can be used with strings with the utf8mb4
character set. For example, the LENGTH string function returns the number of bytes in a string that uses the utf8mb4
character set.
select LENGTH('Hello world!🙂'); **** +----------------------------+ | LENGTH('Hello world!🙂') | +----------------------------+ | 16 | +----------------------------+
JSON Functions
JSON Functions can be used with JSON
columns and string arguments with the utf8mb4
character set. For example, the JSON_AGG function aggregates a JSON
column that supports the utf8mb4
character set.
CREATE TABLE events (name VARCHAR (20), registrations INT, comments JSON COLLATE utf8mb4_general_ci); **** Query OK, 0 rows affected (0.06 sec) INSERT events VALUES ("Swimming",50,'{"Registration closed":"✅"}'), ("Biking",28,'{"Registration is open":"⏸"}'), ("Powerlifting",22,'{"Registration is open":"⏸"}'); **** Query OK, 3 rows affected (0.19 sec) SELECT JSON_AGG(comments) FROM events; **** +-----------------------------------------------------------------------------------------------+ | JSON_AGG(comments) | +-----------------------------------------------------------------------------------------------+ | [{"Registration is open":"⏸"},{"Registration closed":"✅"},{"Registration is open":"⏸"}] | +-----------------------------------------------------------------------------------------------+ 1 row in set (0.05 sec)
Procedural Extensions
In procedural extensions such as stored procedures and user-defined functions, you can use parameters and variables withutf8mb4
Unicode characters. In addition, the tables and columns introduced in procedural extensions can store utf8mb4
Unicode characters.
SingleStore Pipelines
SingleStore Pipelines can ingest and process data with the utf8mb4
character set from the supported data sources. The columns that will store the ingested data must be configured to support the utf8mb4
character set.
LOAD DATA
The LOAD DATA statement allows you to import files with any supported character set, including utf8mb4
, into SingleStore. The columns that will store the imported data must be configured to support the utf8mb4
character set.