Skip to main content

Character Encoding

Working with Character Sets and Collations

A character set is a collection of symbols and their encodings. A collation defines the rules for comparing and sorting the characters in a character set. SingleStore supports a variety of character sets and each character set can have multiple collations.

Unicode Support

SingleStore supports the Unicode standard that includes the characters in the Basic Multilingual Plane (BMP) and the supplementary characters that lie outside the BMP. The first 65536 Unicode characters in the BMP, whose code points range from U+0000 to U+FFFF, are encoded in variable length from 1 to 3 bytes per character. The supplementary characters, whose code points range from U+10000 to U+10FFFF, are encoded in lengths of 4 bytes per character. With the 4-byte character encoding (utf8mb4), SingleStore supports all the characters in the BMP and supplementary characters that lie outside the BMP, including the private use area (PUA) which can contain the encoding of pictographic symbols (emojis) and ancient scripts, such as Egyptian hieroglyphs.