Character Encoding

Working with Character Sets and Collations

A character set is a collection of symbols and their encodings. A collation defines the rules for comparing and sorting the characters in a character set. SingleStore supports a variety of character sets and each character set can have multiple collations.

By default, the character set and collation are set to utf8 and utf8_general_ci, respectively, across the cluster. You can override the default values, as explained in Specifying Character Set and Collation for Clusters.

Unicode Support

SingleStore supports the Unicode standard that includes the characters in the Basic Multilingual Plane (BMP) and the supplementary characters that lie outside the BMP. The first 65536 Unicode characters in the BMP, whose code points range from U+0000 to U+FFFF, are encoded in variable length from 1 to 3 bytes per character. The supplementary characters, whose code points range from U+10000 to U+10FFFF, are encoded in lengths of 4 bytes per character. With the 4-byte character encoding (utf8mb4), SingleStore supports all the characters in the BMP and supplementary characters that lie outside the BMP, including the private use area (PUA) which can contain the encoding of pictographic symbols (emojis) and ancient scripts, such as Egyptian hieroglyphs.

Character Encoding

On this page

Working with Character Sets and Collations

Unicode Support

In this section

Was this article helpful?

On this page

Was this article helpful?