Character Encoding

Working with Character Sets and Collations

A character set is a collection of symbols and their encodings. A collation defines the rules for comparing and sorting the characters in a character set. SingleStore supports a variety of character sets and each character set can have multiple collations.

By default, the character set and collation are set to utf8 and utf8_general_ci, respectively, across the cluster. Your application might require characters or collations that are not supported by SingleStore default settings; e.g., emojis or full-text search. You can override the default values, which is explained on the Specifying Character Set and Collation for Clusters page.

Unicode Support

SingleStore supports the Unicode standard that includes the characters in the Basic Multilingual Plane (BMP) and the supplementary characters that lie outside the BMP. The first 65536 Unicode characters in the BMP, whose code points range from U+0000 to U+FFFF, are encoded in variable length from 1 to 3 bytes per character. The supplementary characters, whose code points range from U+10000 to U+10FFFF, are encoded in lengths of 4 bytes per character. With the 4-byte character encoding (utf8mb4), SingleStore supports all the characters in the BMP and supplementary characters that lie outside the BMP, including the private use area (PUA) which can contain the encoding of pictographic symbols (emojis) and ancient scripts, such as Egyptian hieroglyphs.

In this section

Last modified: November 18, 2022

Was this article helpful?