Full Text VERSION 2 Custom Analyzers

SingleStore supports custom analyzers for full-text VERSION 2 search. Users can specify an analyzer to get a customized full-text search experience. An analyzer contains three components: a tokenizer, character filters, and token filters. An analyzer must have exactly one tokenizer; it can have zero or more character filters and zero or more token filters.

A tokenizer takes a stream of characters and breaks that stream into individual tokens, for example split on whitespace characters. A character filter takes the stream of text data and transforms it in a pre-defined way, for example, removing all HTML tags. Token filters receive a stream of tokens and may add, change, or remove tokens, for example, lowercase all tokens or remove stop words.

Users can choose from a list of pre-configured analyzers and use them without any modifications. Users can also create their own analyzers by specifying a tokenizer, character filters, and token filters to obtain a fully customized search experience.

Refer to Working with Full-Text Search for more information on full-text search.

Specify an Analyzer

Specify an analyzer by passing an analyzer configuration in JSON format to INDEX_OPTIONS, which is a JSON string that contains the index configuration. In this JSON, the analyzer key is a string or a nested JSON value.

Specify the name of a built-in analyzer (e.g.: standard, cjk, etc.) as a string.
Specify a customized built-in analyzer or a custom analyzer as a nested JSON value.

The three examples below show a built-in analyzer with no customizations, a built-in analyzer with a customized set of stop words, and a custom analyzer. A full set of examples can be found in Examples. Refer to Analyzers for details on analyzers.

Specify the built-in analyzer for Chinese, Japanese, and Korean characters, called the cjk analyzer, with no customizations.

SQL

CREATE TABLE t (
	title VARCHAR(200),
	content VARCHAR(200),
	FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
        '{ "analyzer": "cjk"}'
);

Specify the built-in cjk analyzer with a customized set of stop words. Built-in analyzers can be customized with custom stop word lists; no other customizations for built-in analyzers are supported.

SQL

CREATE TABLE t (
	title VARCHAR(200),
	content VARCHAR(200),
	FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS 
        '{"analyzer": { 
               "cjk": {
                    "stopset": [
                           "这",
            	           "那"
                     ]
                }
          }
        }'
);

Note

In addition to the cjk analyzer, the Korean nori analyzer is also supported.

Specify a custom analyzer, which uses the whitespace tokenizer, the html_strip character filter, and the lower_case token filter. The analyzer name must be custom. Additional character and token filters can be specified by adding additional char_filters and token_filters key pairs.

SQL

CREATE TABLE t (
	title VARCHAR(200),
	content VARCHAR(200),
	FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
        '{
	     "analyzer": {
	         "custom": {
	              "tokenizer": "whitespace",
	              "char_filters": ["html_strip"],
  	              "token_filters": ["lower_case"],
                 }
	     }
         }'
);

Analyzers

There are two types of analyzers: built-in analyzers and custom analyzers.

Note

The examples in this section show only the INDEX_OPTIONS string (JSON) and omit the rest of the index creation command.

Built-in Analyzers

Built-in analyzers are pre-configured analyzers including the standard analyzer and language-specific analyzers and do not require configuration. Built-in analyzers may be customized with custom stop-word lists.

The default analyzer is the Apache Lucene standard analyzer, which uses the Apache Lucene standard tokenizer, lowercase token filters, and no stop words.

Specify a built-in analyzer, without customizations, by specifying the name of the analyzer as the value of the analyzer key.

The following example specifies the use of the spanish language analyzer.

SQL

INDEX_OPTIONS '{"analyzer" : "spanish"}'

A custom stop word list can be specified for a built-in analyzer by specifying a stopset in the JSON as shown in the following example. A custom stop word list is the only customization supported for built-in analyzers.

The following example specifies a custom stop word list for the standard analyzer.

The value of the analyzer key is a nested JSON value consisting of a key-value pair with key being the name of the analyzer (spanish in this example), and the value being another key-value pair consisting of the key stopset, and the value a JSON array of stop words.

SQL

INDEX_OPTIONS '{
	"analyzer": {
    	     "spanish": {
        	     "stopset": [
            	     "el",
            	     "la"
        	     ]
    	     }
	}
}'

SingleStore recommends using the default language analyzer, without stop word customization, in most cases, e.g. '{"analyzer" : "catalan"}'.

Refer to Supported Language Analyzers for links to the default list of stop words for each analyzer.

Custom Analyzers

Create a custom analyzer by using the analyzer name custom and by specifying a tokenizer and optional token and character filters.

A custom analyzer must specify:

A required tokenizer - A tokenizer breaks up incoming text into tokens. In many cases, an analyzer will use a tokenizer as the first step in the analysis process. However, to modify text prior to tokenization, use char_filters (see below).
An optional array of token_filters: A token_filter modifies tokens that have been created by the tokenizer. Common modifications performed by a token_filter are deletion, stemming, and case folding.
An optional array of char_filters: A char_filter transforms the text before it is tokenized, while providing corrected character offsets to account for these modifications.

The example below shows the use of all three components, tokenizer, char_filters, and token_filters.

SQL

INDEX_OPTIONS '{
     "analyzer" : {
          "custom": {
	       "tokenizer": "whitespace",
	       "char_filters": ["html_strip"],
  	       "token_filters": ["lower_case"],
          }
     }
}'

Each of these three components (tokenizer, char_filters, token_filters) can be specified as a string with the name of the component or as a nested JSON with a configuration for the component.

The example below specifies a custom analyzer that uses the whitespace tokenizer, with a maximum length of 256 characters.

SQL

INDEX_OPTIONS '{
     "analyzer": {
          "custom": {
               "tokenizer": {
            	     "whitespace": {
                      "maxTokenLen": 256
            	     }
        	}
    	  }
     }
}'

Common Tokenizers

Common tokenizers that are supported are listed in the table below. Refer to Supported Tokenizers for a full list of supported tokenizers.

"tokenizer" (case-sensitive)	Parameters	Description
`whitespace`	`rule` (Optional, string). Defaults to `"unicode"`. `maxTokenLen` (Optional, integer). Defaults to `256`. WhitespaceTokenizerFactory	Divides text at whitespace characters as defined by Character.isWhitespace(int). This definition excludes non-breaking spaces from whitespace characters. WhitespaceTokenizer
`standard`	`maxTokenLength` (Optional, integer). Defaults to 255. StandardTokenizerFactory	Implements Word Break rules from Unicode Text Segmentation: Unicode Standard Annex #29. StandardTokenizer (Lucene 6.6.0 API)
`n_gram`	`minGramSize` (Optional, integer). Defaults to `1`. `maxGramSize` (Optional, integer). Defaults to `2`. NGramTokenizerFactory	Tokenizes the input into n-grams of the specified size(s). NGramTokenizer
`uax_url_email`	`maxTokenLength` (Optional, integer). Defaults to `255`. UAX29URLEmailTokenizerFactory	Implements Word Break rules from Unicode Text Segmentation: Unicode Standard Annex #29. URLs and email addresses are also tokenized. UAX29URLEmailTokenizer

"tokenizer" (case-sensitive)

Parameters

Description

whitespace

rule (Optional, string). Defaults to "unicode".

maxTokenLen (Optional, integer). Defaults to 256.

WhitespaceTokenizerFactory

Divides text at whitespace characters as defined by Character.isWhitespace(int). This definition excludes non-breaking spaces from whitespace characters.

WhitespaceTokenizer

standard

maxTokenLength (Optional, integer). Defaults to 255.

StandardTokenizerFactory

Implements Word Break rules from Unicode Text Segmentation: Unicode Standard Annex #29.

StandardTokenizer (Lucene 6.6.0 API)

n_gram

minGramSize (Optional, integer). Defaults to 1.

maxGramSize (Optional, integer). Defaults to 2.

NGramTokenizerFactory

Tokenizes the input into n-grams of the specified size(s).

NGramTokenizer

uax_url_email

maxTokenLength (Optional, integer). Defaults to 255.

UAX29URLEmailTokenizerFactory

Implements Word Break rules from Unicode Text Segmentation: Unicode Standard Annex #29. URLs and email addresses are also tokenized.

UAX29URLEmailTokenizer

Common Token Filters

Common token filters that are supported are listed in the table below. Refer to Supported Token Filters for a full list of supported token filters.

"token_filters" (Case-Sensitive)	Parameters	Description
`lower_case`	No parameters. LowerCaseFilterFactory	Normalizes token text to lower case. LowerCaseFilter
`snowball_porter`	`protected` (Optional, string). Defaults to `"protectedkeyword.txt"`. `language` (Optional, string). Defaults to `"English"`. SnowballPorterFilterFactory	Stems words using a Snowball-generated stemmer. Available stemmers are listed in `org.tartarus.snowball.ext`. SnowballFilter
`n_gram`	`minGramSize` (Optional, integer). Defaults to `1`. `maxGramSize` (Optional, integer). Defaults to `2`. `preserveOriginal` (Optional, boolean). Defaults to `"true"`. NGramFilterFactory	Tokenizes the input into n-grams of the given size(s). NGramTokenFilter
`stop`	`words` (Optional, array of stop words) `ignoreCase` (Optional, boolean). If true, all words are lower-cased first. Defaults to `false`.	Custom stop words token filter. Removes stop words from a token stream.

"token_filters" (Case-Sensitive)

Parameters

Description

shingle

minShingleSize (Optional, integer). Defaults to 2.

maxShingleSize (Optional, integer). Defaults to 2.

ShingleFilterFactory

Constructs shingles (token n-grams), that is it creates combinations of tokens as a single token.

ShingleFilter

lower_case

No parameters.

LowerCaseFilterFactory

Normalizes token text to lower case.

LowerCaseFilter

snowball_porter

protected (Optional, string). Defaults to "protectedkeyword.txt".

language (Optional, string). Defaults to "English".

SnowballPorterFilterFactory

Stems words using a Snowball-generated stemmer. Available stemmers are listed in org.tartarus.snowball.ext.

SnowballFilter

n_gram

minGramSize (Optional, integer). Defaults to 1.

maxGramSize (Optional, integer). Defaults to 2.

preserveOriginal (Optional, boolean). Defaults to "true".

NGramFilterFactory

Tokenizes the input into n-grams of the given size(s).

NGramTokenFilter

stop

words (Optional, array of stop words)

ignoreCase (Optional, boolean). If true, all words are lower-cased first. Defaults to false.

Custom stop words token filter.

Removes stop words from a token stream.

Custom Stop Words

SingleStore provides a custom token filter named stop which allows a set of custom stop words to be specified. The stop token filter works with any custom analyzer.

The stop token filter has two parameters words and ignoreCase.

words: An optional parameter containing a list of stop words. The list of stop words must be specified as a JSON array.
ignoreCase: An optional boolean parameter indicating if case should be ignored. If set to true, all words are lower-cased before tokenization. Defaults to false.

Sample syntax for this custom token filter is as follows. Refer to Example 11: Standard Tokenizer with Custom Stop Words for additional examples of using the stop token filter.

SQL

  FULLTEXT USING VERSION 2 KEY (text) INDEX_OPTIONS 
   '{"analyzer" : 
      {"custom" : 
        {"tokenizer" : "standard", 
          token_filters: [{"stop": 
                            {"ignoreCase": false,
                             "words": ["the"]}}
                         ]}
     }}');

Common Character Filters

Common character filters that are supported are listed in the table below. Refer to Supported Character Filters for a full list of supported character filters.

"char_filters" (case-sensitive)	Parameters (includes Lucene Link)	Description (includes Lucene Link)
`html_strip`	`escapedTags` (Optional, string). Defaults to `"a, title"`. HTMLStripCharFilterFactory	Wraps another Reader and attempts to strip out HTML. HTMLStripCharFilter

"char_filters" (case-sensitive)

Parameters (includes Lucene Link)

Description (includes Lucene Link)

html_strip

escapedTags (Optional, string). Defaults to "a, title".

HTMLStripCharFilterFactory

Wraps another Reader and attempts to strip out HTML.

HTMLStripCharFilter

Custom Column Mappings

Note

This feature is an opt-in preview.

Opt-in previews allow you to evaluate and provide feedback on new and upcoming features prior to their general availability.

Custom column mappings allow columns in a table and fields in a JSON document to be indexed using custom analyzers. This functionality allows you to index columns or fields in different languages with analyzers specific to each language.

For the most meaningful search results, it is important to use an analyzer specific to a language. Language-specific analyzers allow the use of language-aware stop lists, stemming, and word breaking. Applying a generic analyzer to text from multiple languages may result in lower-quality search results.

Use Per-Column Analyzers

Per-column analyzers are defined using INDEX_OPTIONS in the table or index creation command. You can specify a custom analyzer for each column in a table and for keypaths in JSON and BSON columns.

The following command creates a table with a full-text index that indexes the french_content column with the french analyzer, the english_content column with the english analyzer, and all other columns (title) with the standard analyzer.

SQL

CREATE TABLE custom_col_analyzers (
    title VARCHAR(200),
    french_content VARCHAR(200),
    english_content VARCHAR(200),
    FULLTEXT USING VERSION 2 (title, french_content, english_content)
    INDEX_OPTIONS
        '{
            "analyzer": "standard",
            "mappings": {
                "french_content": {
                    "analyzer": "french"
                 },
                 "english_content": {
                     "analyzer": "english"
                 }
             }
         }'
);

JSON and BSON Keypath Analyzers

Analyzers for keypaths in JSON and BSON columns are defined using INDEX_OPTIONS in the table or index creation command.

The following command specifies that the json_column$english_content field be indexed using the english analyzer, and the json_column$french_content field be indexed using the french analyzer. The mappings object, highlighted in bold, provides this specification.

SQL

CREATE TABLE json_keypath_analyzers (
    json_column JSON,
    FULLTEXT USING VERSION 2 KEY(json_column)
    INDEX_OPTIONS '{
        "mappings": {
            "json_column$english_content": {
                "analyzer": "english"
             },
            "json_column$french_content": {
                 "analyzer": "french"
             }
        }
    }'
);

Example of Using Parameters

Specify a default uax_url_email tokenizer:

SQL

INDEX_OPTIONS '{
     "analyzer": {
           "custom": {
        	"tokenizer": "uax_url_email"
    	   }
     }
}'

Specify a uax_url_email tokenizer with custom parameters:

SQL

INDEX_OPTIONS '{
     "analyzer": {
    	  "custom": {
               "tokenizer": {
                    "uax_url_email" : {
	                 "maxTokenLength": 300
                    }
                }
    	  }
     }
}'

Stemming

Stemming transforms words to their root, often by removing suffixes and prefixes. In English the words "dressing" and "dressed", can be stemmed to "dress". Thus, searches for one form of a verb (e.g. "dressing") can return documents containing other forms of the verb (e.g. "dressed" or "dress"). Stemming is language specific. Stemming is handled in JLucene using built-in language analyzers or can be customized using token filters.

NGrams

NGram tokenizers split words into small pieces and are good for fast "fuzzy-style" matching using a full-text index. The minimum and maximum gram length is customizable. Refer to Example 7: Custom Analyzer and N Gram Tokenizer for an example of using a ngram tokenizer.

Examples

Example 1: Custom Analyzer with Whitespace tokenizer

Use a custom analyzer and a whitespace tokenizer to search for text with a hyphen in queries.

Create a table, insert data, and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE medium_articles (
   title VARCHAR(200),
   summary TEXT,
   FULLTEXT USING VERSION 2 (summary) INDEX_OPTIONS
   '{
       "analyzer": {
           "custom": {
               "tokenizer": "whitespace"
           }
       }
   }'
);

INSERT INTO medium_articles (title, summary) VALUES
('Build Real-Time Multimodal RAG Applications Using SingleStore!','This guide teaches you how to build a multimodal Retrieval-Augmented Generation (RAG) application using SingleStore, integrating various data types for enhanced AI responses.'),
('Building Production-Ready AI Agents with LangGraph: A Real-Life Use Case','This guide offers a solution for creating a scalable, production-ready multi-modal chatbot using LangChain, focusing on dividing tasks for improved control and efficiency.'),
('Scaling RAG from POC to Production','This guide explains Retrieval-Augmented Generation (RAG) for building reliable, context-aware applications using large language models (LLMs) and emphasizes the importance of scaling from proof of concept to production.'),
('Tech Stack For Production-Ready LLM Applications In 2024','This guide reviews preferred tools for the entire LLM app development lifecycle, emphasizing simplicity and ease of use in building scalable AI applications.'),
('LangGraph + Gemini Pro + Custom Tool + Streamlit = Multi-Agent Application Development','This guide teaches you to create a chatbot using LangGraph and Streamlit, leveraging LangChain for building stateful multi-actor applications that respond to user support requests.');

OPTIMIZE TABLE medium_articles FLUSH;

Observe the difference between the results of the two search queries below.

SQL

SELECT * 
FROM medium_articles 
WHERE MATCH(TABLE medium_articles) AGAINST ("summary:multimodal");

+----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| title                                                          | summary                                                                                                                                                                        |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Build Real-Time Multimodal RAG Applications Using SingleStore! | This guide teaches you how to build a multimodal Retrieval-Augmented Generation (RAG) application using SingleStore, integrating various data types for enhanced AI responses. |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

SQL

SELECT * 
FROM medium_articles 
WHERE MATCH(TABLE medium_articles) AGAINST ("summary:multi-modal");

+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| title                                                                    | summary                                                                                                                                                                     |
+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Building Production-Ready AI Agents with LangGraph: A Real-Life Use Case | This guide offers a solution for creating a scalable, production-ready multi-modal chatbot using LangChain, focusing on dividing tasks for improved control and efficiency. |
+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Example 2: Custom Analyzer with Whitespace tokenizer, html_strip as char_filters, and lower_case as token_filters

Use a custom analyzer, a whitespace tokenizer, html_strip as a character filter, and lower_case as a token filter to search for HTML entities in queries.

A character filter receives the original text data and converts it into a predefined format. A token filter receives a stream of tokens and can add, change, or remove tokens as needed.

In this example, html_strip as a character filter removes HTML tags and lower_case as a token filter lowercases the tokens.

Create a table, insert data, and optimize the table to ensure all data is included in results.

Search for HTML entities in queries.

SQL

CREATE TABLE html_table (
   title VARCHAR(200),
   content VARCHAR(200),
   FULLTEXT USING VERSION 2 (title, content) INDEX_OPTIONS
  '{
     "analyzer": {
         "custom": {"char_filters": ["html_strip"],
                    "tokenizer": "whitespace",
                    "token_filters":["lower_case"]
                    }
     }
  }'
);

INSERT INTO html_table (title, content) VALUES
('Exciting News', 'We&apos;re thrilled to announce our new project!</p>'),
('Learning Journey', 'Learning is a never-ending journey &amp; I&apos;m excited!</p>'),
('Success Story', 'Our team has achieved great things &amp; we&apos;re proud!</p>'),
('Grateful Heart', 'Thank you for being a part of our journey &amp; supporting us!</p>'),
('Future Goals', 'We&apos;re looking forward to achieving even more!</p>');

OPTIMIZE TABLE html_table FLUSH;

Search for HTML entity, and observe the result of the search query.

SQL

SELECT * 
FROM html_table 
WHERE match(TABLE html_table) AGAINST("content:we're");

+---------------+----------------------------------------------------------------+
| title         | content                                                        |
+---------------+----------------------------------------------------------------+
| Success Story | Our team has achieved great things &amp; we&apos;re proud!</p> |
| Exciting News | We&apos;re thrilled to announce our new project!</p>           |
| Future Goals  | We&apos;re looking forward to achieving even more!</p>         |
+---------------+----------------------------------------------------------------+

Example 3: Custom Analyzer with standard tokenizer and cjk_width as token_filter

Use a custom analyzer, a standard tokenizer, and cjk_width as a token filter to search for a Japanese text in queries.

In this example, cjk_width as a token filter normalizes the width differences in CJK (Chinese, Japanese, and Korean) characters.

Create a table, insert data and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE japanese_novels (
 title VARCHAR(200),
 content VARCHAR(200),
 FULLTEXT USING VERSION 2 (title, content) INDEX_OPTIONS
'{
   "analyzer": {
       "custom": {"tokenizer": "standard",
                  "token_filters":["cjk_width"]
                  }
   }
}'
);

INSERT INTO japanese_novels (title, content) VALUES
('ノルウェイの森', '村上春樹の代表作で、愛と喪失をテーマにしています。'),
('吾輩は猫である', '夏目漱石による作品で、猫の視点から人間社会を描いています。'),
('雪国', '川端康成の作品で、美しい雪景色と切ない恋を描いています。'),
('千と千尋の神隠し', '宮崎駿の作品で、少女が異世界で成長する物語です。'),
('コンビニ人間', '村田沙耶香の作品で、現代社会の孤独と適応を描いています。');

OPTIMIZE TABLE japanese_novels FLUSH;

Observe the result of the search query for the Japanese text below.

SQL

SELECT * 
FROM japanese_novels 
WHERE MATCH(TABLE japanese_novels) AGAINST("content: 夏");

+-----------------------+-----------------------------------------------------------------------------------------+
| title                 | content                                                                                 |
+-----------------------+-----------------------------------------------------------------------------------------+
| 吾輩は猫である        | 夏目漱石による作品で、猫の視点から人間社会を描いています。                              |
+-----------------------+-----------------------------------------------------------------------------------------+

Example 4: Korean (nori) Analyzer

Use the korean analyzer to search for Korean text in queries. This analyzer is also known as the nori analyzer.

Create a table, insert data and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE k_drama (
 genre VARCHAR(200),
 movie_name TEXT,
 cast TEXT,
 FULLTEXT USING VERSION 2 (genre) INDEX_OPTIONS
 '{
     "analyzer": "korean"
 }'
);

INSERT INTO k_drama (genre, movie_name, cast)
VALUES
 ('로맨스', '사랑의 불시착', '현빈, 손예진'),
 ('액션, 스릴러', '빈센조', '송중기, 전여빈'),
 ('드라마, 로맨스', '도깨비', '공유, 김고은'),
 ('사극, 드라마', '미스터 션샤인', '이병헌, 김태리'),
 ('코미디, 로맨스', '김비서가 왜 그럴까', '박서준, 박민영');

OPTIMIZE TABLE k_drama FLUSH;

Observe the result of the search query for the Korean text below.

SQL

SELECT * FROM k_drama WHERE match(TABLE k_drama) AGAINST("genre:로맨스");

+----------------------+----------------------------+----------------------+
| genre                | movie_name                 | cast                 |
+----------------------+----------------------------+----------------------+
| 드라마, 로맨스          | 도깨비                       | 공유, 김고은            |
| 코미디, 로맨스          | 김비서가 왜 그럴까              | 박서준, 박민영          |
| 로맨스                | 사랑의 불시착                  | 현빈, 손예진            |
+----------------------+----------------------------+----------------------+

Example 5: Korean (nori) Analyzer with User Dictionary

Use the korean analyzer (also known as the nori analyzer) with and without a user dictionary to search for a Korean text in queries. This example demonstrates how a user dictionary can be used to add words, specifically compound words, to the dictionary used by the korean analyzer. Note that the user dictionary is augmentative, meaning that it adds the specified words to the existing dictionary, it does not replace the existing dictionary.

Create two tables with a full text index with the korean analyzer, one with and one without a user dictionary. The Korean compound word 수영장, which translates to "swimming pool" in English is inserted in the user dictionary.

SQL

CREATE TABLE korean_user_dict
    (id INT, 
     phrase VARCHAR(400), 
     FULLTEXT USING VERSION 2 (phrase) INDEX_OPTIONS 
       '{"analyzer" : 
         {"custom": 
          {"tokenizer": 
           {"korean": {"userDictionary": ["수영장"]}
           }
         }
        } 
}');

SQL

CREATE TABLE korean
     (id INT,
      phrase varchar(400), 
      FULLTEXT USING VERSION 2 (phrase) INDEX_OPTIONS
        '{
           "analyzer": "korean"
         }'
);

Insert data into the tables and optimize them to ensure all data is included in results.

SQL

INSERT INTO korean_user_dict 
    VALUES (1, "수영장"), (2, "수영"), (3, "장");
OPTIMIZE TABLE korean_user_dict FLUSH;

SQL

INSERT INTO korean 
    VALUES (1, "수영장"), (2, "수영"), (3, "장");
OPTIMIZE TABLE korean FLUSH;

When 수영장 is inserted into the korean_user_dict table, because 수영장 is in the user dictionary, 수영장 is tokenized as a single, atomic token and will not be further tokenized.

In contrast, when 수영장 is inserted into the korean table, 수영장 is tokenized into 수영 and 장. By default, compound words are decomposed and the original form is discarded (decompoundMode is discard by default).

Example 5a - Search for 수영장 with and without User Dictionary

The following queries search for the compound word 수영장 in both the korean_user_dict and korean tables. This and the following example demonstrate searching with and without a user dictionary.

The ORDER BY clause is included to ensure consistent ordering of results.

SQL

SELECT id, phrase, BM25(korean_user_dict, 'phrase:(수영장)') AS score 
FROM korean_user_dict 
ORDER BY id;

+------+-----------------+--------------------+
| id   | text            | score              |
+------+-----------------+--------------------+
|    1 | 수영장            | 0.4458314776420593 |
|    2 | 수영              |                  0 |
|    3 | 장               |                  0 |
+------+-----------------+--------------------+

SQL

SELECT id, phrase, BM25(korean, 'phrase:(수영장)') AS SCORE
FROM korean 
ORDER BY id;

+------+-----------------+---------------------+
| id   | text            | score               |
+------+-----------------+---------------------+
|    1 | 수영장            | 0.35471969842910767 |
|    2 | 수영             | 0.23797652125358582 |
|    3 | 장               | 0.23797652125358582 |
+------+-----------------+---------------------+

When searching the korean_user_dict table, 수영장 matches only 수영장 and not 수영 or 장 because 수영장 is tokenized as a single, atomic token.

In contrast, when searching the korean table, 수영장 matches 수영장, 수영, and 장 because 수영장 is tokenized as two tokens: 수영, and 장, hence 수영장 partially matches all three rows in the table.

Finally, the score for 수영장 is higher when searching the table korean_user_dict than when searching the table korean because 수영장 matches only a single row in korean_user_dict.

Example 5b - Search for 수영 with and without User Dictionary

The following queries search for the word 수영 in the tables with and without user dictionary. The ORDER BY clause is included to ensure consistent ordering of results.

SQL

SELECT id, phrase, BM25(korean_user_dict, 'phrase:(수영)') AS SCORE 
FROM korean_user_dict ORDER BY id;

+------+-----------------+--------------------+
| id   | text            | SCORE              |
+------+-----------------+--------------------+
|    1 | 수영장            |                  0 |
|    2 | 수영             | 0.4458314776420593 |
|    3 | 장               |                  0 |
+------+-----------------+--------------------+

SQL

SELECT id, phrase, BM25(korean, 'phrase:(수영)') AS SCORE
FROM korean ORDER BY id;

+------+-----------------+---------------------+
| id   | text            | SCORE               |
+------+-----------------+---------------------+
|    1 | 수영장            | 0.17735984921455383 |
|    2 | 수영             |  0.23797652125358582 |
|    3 | 장               |                   0 |
+------+-----------------+---------------------+

When searching the korean_user_dict table, 수영 matches only itself, because, as described earlier, 수영장 has been inserted in the user dictionary and 수영장 is tokenized as a single token, so it does not match 수영.

In contrast, when searching the korean table, 수영 matches 수영장 and 수영 because 수영장 is tokenized as two tokens: 수영, and 장.

In addition, the score for 수영 is higher when searching the table korean_user_dict than when searching the table korean because 수영 matches only a single row in korean_user_dict.

Example 6: Custom Analyzer, standard tokenizer, Italian language, snowball_porter stemmer, elision filter

Use a custom analyzer, a standard tokenizer, elision and snowball_porter as token filters for the Italian language to search for an Italian text.

In this example, elision as a token filter removes specific elisions from the input token. Using snowball_porter as a token filter stems the words using the Lucene Snowball stemmer tokenization. The snowball_porter token filter requires a language parameter to control the stemmer.

Create a table, insert data and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE italian_architecture (
   architecture VARCHAR(400),
   description VARCHAR(400),
   SORT KEY (architecture),
   FULLTEXT USING VERSION 2 KEY(description)
   INDEX_OPTIONS '{"analyzer" : 
         {"custom" : {"tokenizer" : "standard", 
                      "token_filters": ["elision", 
                                       {"snowball_porter" : {"language": "Italian"}}]}}}}'
);

INSERT INTO italian_architecture (architecture, description) VALUES
('Colosseo', 'Un antico anfiteatro situato a Roma, noto per i combattimenti dei gladiatori.'),
('Torre Pendente di Pisa', 'Un campanile famoso per la sua inclinazione non intenzionale.'),
('Basilica di San Pietro', 'Una chiesa rinascimentale in Vaticano, famosa per la sua cupola progettata da Michelangelo.'),
('Duomo di Milano', 'L’architettura di Milano, nota per la sua straordinaria architettura gotica e le guglie.'),
('Palazzo Ducale', 'Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.');

OPTIMIZE TABLE italian_architecture FLUSH;

Observe the result of the search query for the Italian text below.

SQL

SELECT * 
FROM italian_architecture 
WHERE MATCH(TABLE italian_architecture) AGAINST("description:l’architettura");

+-----------------+--------------------------------------------------------------------------------------------+
| architecture    | description                                                                                |
+-----------------+--------------------------------------------------------------------------------------------+
| Duomo di Milano | L’architettura di Milano, nota per la sua straordinaria architettura gotica e le guglie.   |
| Palazzo Ducale  | Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.        |
+-----------------+--------------------------------------------------------------------------------------------+

Use a custom analyzer, a standard tokenizer, snowball_porter as token filters for the Italian language without elision token filter to search for an Italian text in queries.

Create a second table, insert data and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE italian_architecture_2 (
   architecture VARCHAR(400),
   description VARCHAR(400),
   SORT KEY (architecture),
   FULLTEXT USING VERSION 2 KEY(description)
   INDEX_OPTIONS '{"analyzer" : 
         {"custom" : {"tokenizer" : "standard", 
                      "token_filters": {"snowball_porter" : {"language": "Italian"}}}}}}'
);

INSERT INTO italian_architecture_2 (architecture, description) VALUES
('Colosseo', 'Un antico anfiteatro situato a Roma, noto per i combattimenti dei gladiatori.'),
('Torre Pendente di Pisa', 'Un campanile famoso per la sua inclinazione non intenzionale.'),
('Basilica di San Pietro', 'Una chiesa rinascimentale in Vaticano, famosa per la sua cupola progettata da Michelangelo.'),
('Duomo di Milano', 'L’architettura di Milano, nota per la sua straordinaria architettura gotica e le guglie.'),
('Palazzo Ducale', 'Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.');

OPTIMIZE TABLE italian_architecture_2 FLUSH;

Observe the result of the search query for the Italian text without elision token filter below.

SQL

SELECT * 
FROM italian_architecture_2 
WHERE MATCH(TABLE italian_architecture_2) AGAINST("description:l’architettura");

+----------------+---------------------------------------------------------------------------------------+
| architecture   | description                                                                           |
+----------------+---------------------------------------------------------------------------------------+
| Palazzo Ducale | Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.   |
+----------------+---------------------------------------------------------------------------------------+

Example 7: Custom Analyzer and N Gram Tokenizer

Use a custom analyzer and a n_gram tokenizer to search for misspelled text in queries.

Create a table, insert data and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE university
   (name VARCHAR(400),
   admission_page VARCHAR(400),
   SORT KEY (name),
   FULLTEXT USING VERSION 2 KEY(admission_page)
   INDEX_OPTIONS '{"analyzer" : {"custom" : {"tokenizer" : "n_gram"}}}}');

INSERT INTO university (name, admission_page) VALUES
('Harvard University', 'college.harvard.edu/admissions'),
('Stanford University', 'stanford.edu/admission/'),
('Massachusetts Institute of Technology (MIT)', 'mitadmissions.org/'),
('California Institute of Technology (Caltech)', 'admissions.caltech.edu/'),
('University of Chicago', 'uchicago.edu/en/admissions');

OPTIMIZE TABLE university FLUSH;

Observe the result of the search query for the misspelled text and compare the search result with the score below.

SQL

SELECT name,admission_page, MATCH(TABLE university) AGAINST("admission_page:cattec") AS score
FROM university
WHERE score
ORDER BY score DESC;

+----------------------------------------------+--------------------------------+---------------------+
| name                                         | admission_page                 | score               |
+----------------------------------------------+--------------------------------+---------------------+
| California Institute of Technology (Caltech) | admissions.caltech.edu/        |  2.4422175884246826 |
| University of Chicago                        | uchicago.edu/en/admissions     |  0.8550153970718384 |
| Harvard University                           | college.harvard.edu/admissions |  0.6825864911079407 |
| Stanford University                          | stanford.edu/admission/        |  0.5768249034881592 |
| Massachusetts Institute of Technology (MIT)  | mitadmissions.org/             | 0.26201900839805603 |
+----------------------------------------------+--------------------------------+---------------------+

Example 8: Custom Analyzer with N-Gram Tokenizer, html_strip as char_filters, and lower_case as token_filters

Use a custom analyzer, n_gram tokenizer, html_strip as character filter, and lower_case as token filter to search for HTML entities in queries.

Create a table, insert data and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE html_table_n_gram (
  title VARCHAR(200),
  content VARCHAR(200),
  FULLTEXT USING VERSION 2 (title, content) INDEX_OPTIONS
 '{
    "analyzer": {
        "custom": {"char_filters": ["html_strip"],
                   "tokenizer": "n_gram",
                   "token_filters":["lower_case"]
                   }
    }
 }'
);

INSERT INTO html_table_n_gram (title, content) VALUES
('Exciting News', 'We&apos;re thrilled to announce our new project!</p>'),
('Learning Journey', 'Learning is a never-ending journey &amp; I&apos;m excited!</p>'),
('Success Story', 'Our team has achieved great things &amp; we&apos;re proud!</p>'),
('Grateful Heart', 'Thank you for being a part of our journey &amp; supporting us!</p>'),
('Future Goals', 'We&apos;re looking forward to achieving even more!</p>');

OPTIMIZE TABLE html_table_n_gram FLUSH;

Observe the result of the search query for the misspelled HTML entity and compare the search result with the score below.

SQL

SELECT title,content, MATCH(TABLE html_table_n_gram) AGAINST("content:I',") AS score
FROM html_table_n_gram
WHERE score
ORDER BY score DESC;

+------------------+--------------------------------------------------------------------+---------------------+
| title            | content                                                            | score               |
+------------------+--------------------------------------------------------------------+---------------------+
| Learning Journey | Learning is a never-ending journey &amp; I&apos;m excited!</p>     |  0.5430432558059692 |
| Success Story    | Our team has achieved great things &amp; we&apos;re proud!</p>     | 0.31375283002853394 |
| Exciting News    | We&apos;re thrilled to announce our new project!</p>               | 0.26527124643325806 |
| Future Goals     | We&apos;re looking forward to achieving even more!</p>             |  0.2177681028842926 |
| Grateful Heart   | Thank you for being a part of our journey &amp; supporting us!</p> |  0.1819886565208435 |
+------------------+--------------------------------------------------------------------+---------------------+

Example 9: Portuguese Analyzer with score

Use a portuguese analyzer to search for a Portuguese text in queries.

Create a table, insert data and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE portuguese_news (
  headline VARCHAR(200),
  content TEXT,
  FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
  '{
      "analyzer": "portuguese"
  }'
);

INSERT INTO portuguese_news (headline, content) VALUES
('Cenário Econômico Brasileiro', 'O Brasil enfrenta desafios econômicos com a inflação em alta e a taxa de desemprego ainda elevada.'),
('Mercado de Ações em Alta', 'As ações brasileiras registraram ganhos significativos, impulsionadas por resultados financeiros positivos de grandes empresas.'),
('Nova Política Monetária do Banco Central', 'O Banco Central do Brasil anunciou mudanças na política monetária para conter a inflação e estimular o crescimento econômico.'),
('Investimentos Estrangeiros no Brasil', 'O país atraiu um aumento de investimentos estrangeiros diretos, especialmente em setores de tecnologia e energia renovável.'),
('Tendências do Mercado Imobiliário', 'O mercado imobiliário brasileiro mostra sinais de recuperação, com aumento nas vendas de imóveis e novos lançamentos.');

OPTIMIZE TABLE portuguese_news FLUSH;

Observe the result of the search query for the Portuguese text and compare the search result with the score below.

SQL

SELECT content, MATCH(TABLE portuguese_news) AGAINST ("content:Brasil") AS score
FROM portuguese_news
WHERE score
ORDER BY score DESC;

+-------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| content                                                                                                                             | score               |
+-------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| O Brasil enfrenta desafios econômicos com a inflação em alta e a taxa de desemprego ainda elevada.                                  | 0.22189012169837952 |
| O Banco Central do Brasil anunciou mudanças na política monetária para conter a inflação e estimular o crescimento econômico.       |  0.2059776782989502 |
+-------------------------------------------------------------------------------------------------------------------------------------+---------------------+

Example 10: Spanish Analyzer with custom stop words

Use a spanish analyzer with custom stop words to search for a Spanish text in queries.

Create a table, insert data and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE spanish_news (
 headline VARCHAR(200),
 content TEXT,
 FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
 '{
     "analyzer": {"spanish": {"stopset": ["descubrimiento", "tratamiento", "nuevo"]}}
 }'
);

INSERT INTO spanish_news (headline, content) VALUES
('Descubrimiento de un nuevo tratamiento para la diabetes', 'Investigadores han desarrollado un tratamiento innovador que mejora el control del azúcar en sangre en pacientes diabéticos.'),
('Avances en la detección temprana del cáncer', 'Un nuevo método permite detectar el cáncer en etapas más tempranas, aumentando las posibilidades de tratamiento exitoso.'),
('Nuevo enfoque para tratar enfermedades cardíacas', 'Se ha introducido un nuevo enfoque terapéutico que reduce significativamente el riesgo de ataques cardíacos.'),
('Investigación sobre un gen relacionado con el Alzheimer', 'Científicos han identificado un gen que podría estar vinculado a la enfermedad de Alzheimer, lo que abre nuevas posibilidades para el tratamiento.'),
('Desarrollo de una vacuna contra COVID-19', 'Un equipo de investigadores ha anunciado resultados prometedores en la efectividad de una nueva vacuna contra COVID-19.');

OPTIMIZE TABLE spanish_news FLUSH;

Observe the results of two search queries below: one for the defined Spanish stop word and another for the actual Spanish stop word. The defined stop words above overwrite the actual stop words.

SQL

SELECT * 
FROM spanish_news 
WHERE MATCH(TABLE spanish_news) AGAINST("content:nuevo");

Empty set

SQL

SELECT * 
FROM spanish_news 
WHERE MATCH(TABLE spanish_news) AGAINST("content:el");

+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| headline                                                 | content                                                                                                                                              |
+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Investigación sobre un gen relacionado con el Alzheimer  | Científicos han identificado un gen que podría estar vinculado a la enfermedad de Alzheimer, lo que abre nuevas posibilidades para el tratamiento.   |
| Avances en la detección temprana del cáncer              | Un nuevo método permite detectar el cáncer en etapas más tempranas, aumentando las posibilidades de tratamiento exitoso.                             |
| Descubrimiento de un nuevo tratamiento para la diabetes  | Investigadores han desarrollado un tratamiento innovador que mejora el control del azúcar en sangre en pacientes diabéticos.                         |
| Nuevo enfoque para tratar enfermedades cardíacas         | Se ha introducido un nuevo enfoque terapéutico que reduce significativamente el riesgo de ataques cardíacos.                                         |
+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+

Example 11: Standard Tokenizer with Custom Stop Words

Use a standard tokenizer with custom stop words.

Create a procedure to insert data into a table named t. This procedure will be used to generate data for the examples in this section.

SQL

DELIMITER //

CREATE OR REPLACE PROCEDURE insert_flush()
AS
BEGIN
   INSERT INTO t VALUES( 1, "On April 23, 2013, SingleStore launched its first generally available version of the database to the public as MemSQL."),

   (2, "Early versions only supported row-oriented tables, and were highly optimized for cases where all data can fit within main memory."),

   (3, "This design was based on the idea that the cost of RAM would continue to decrease exponentially over time, in a trend similar to Moore's law."),

   (4, "This would eventually allow most use cases for database systems to store their data exclusively in memory not on disk."),

   (5, "Shortly after launch, MemSQL added general support for an on-disk column-based storage format to work alongside the in-memory rowstore."),

   (6, "The decreases in cost of memory slowed over time, and the market for purely in-memory database systems largely failed to materialize, with increasing demand for disk-based OLAP workloads."),

   (7, "Thus, over time, MemSQL's columnstore became a major focus and a crucial feature for customers."),

   (8, "On October 27, 2020, MemSQL rebranded to SingleStore to reflect a shift in focus away from exclusively in-memory workloads."),

   (9, "The new name highlights the goal of achieving a universal storage format capable of supporting both transactional and analytical use cases."),

   (10, "In its current product release, v.7.5, SingleStore became the first and only database to combine separation of storage and compute plus system of record into a single platform."),

   (11, "Headquartered in San Francisco, California, in June 2021 singlestore.com opened an office in Raleigh, North Carolina. As part of the office opening, SingleStore launched Launch Pad, a center for innovation to incubate and prototype solutions."),

   (12, "Its other offices include Sunnyvale, California, seattle@singlestore.com, Washington, and Lisbon, Portugal."),

   (13, "seattle@singlestore.com");

   OPTIMIZE TABLE t FLUSH;

END;//

DELIMITER ;

Example 11a: No Stop Words

Create a table without stop words and insert data into that table.

SQL

CREATE TABLE t ( 
  id INT,text VARCHAR(400), 
  SORT KEY (id), 
  FULLTEXT USING VERSION 2 KEY(text) 
    INDEX_OPTIONS 
      '{"analyzer" : 
         {"custom" : {"tokenizer" : "standard"}}
       }');

CALL insert_flush();

The two queries below query the table for the words The and the, respectively. Since there are no stop words, results will be returned for both queries.

SQL

SELECT text, MATCH(table t) AGAINST ('text:The') AS SCORE 
FROM t 
WHERE score 
ORDER BY score DESC;

*** 1. row ***
 text: The new name highlights the goal of achieving a universal storage format capable of supporting both transactional and analytical use cases.
SCORE: 0.5822426080703735
*** 2. row ***
 text: The decreases in cost of memory slowed over time, and the market for purely in-memory database systems largely failed to materialize, with increasing demand for disk-based OLAP workloads.
SCORE: 0.4953887462615967
2 rows in set (0.09 sec)

SQL

SELECT text, MATCH(TABLE t) AGAINST ('text:the') AS SCORE 
FROM t 
WHERE score 
ORDER BY score DESC;

*** 1. row ***
 text: Shortly after launch, MemSQL added general support for an on-disk column-based storage format to work alongside the in-memory rowstore.
SCORE: 0.5361358523368835
*** 2. row ***
 text: On April 23, 2013, SingleStore launched its first generally available version of the database to the public as MemSQL.
SCORE: 0.216837078332901
*** 3. row ***
 text: This design was based on the idea that the cost of RAM would continue to decrease exponentially over time, in a trend similar to Moore's law.
SCORE: 0.19964565336704254
*** 4. row ***
 text: The new name highlights the goal of achieving a universal storage format capable of supporting both transactional and analytical use cases.
SCORE: 0.1568010002374649
*** 5. row ***
 text: In its current product release, v.7.5, SingleStore became the first and only database to combine separation of storage and compute plus system of record into a single platform.
SCORE: 0.13726447522640228
*** 6. row ***
 text: The decreases in cost of memory slowed over time, and the market for purely in-memory database systems largely failed to materialize, with increasing demand for disk-based OLAP workloads.
SCORE: 0.1351594626903534
*** 7. row ***
 text: Headquartered in San Francisco, California, in June 2021 singlestore.com opened an office in Raleigh, North Carolina. As part of the office opening, SingleStore launched Launch Pad, a center for innovation to incubate and prototype solutions.
SCORE: 0.12553386390209198

Drop the table so that the tablename t can be used with the subsequent examples.

SQL

DROP TABLE t;

Example 11b: Default English Stop Words

Create a table using default English stop words and query that table for the words The and the. The words The and the are included in default English stop words. Thus, neither query returns results.

SQL

CREATE TABLE t (
  id INT, text VARCHAR(400), 
  SORT KEY (id), 
  FULLTEXT USING VERSION 2 KEY (text) INDEX_OPTIONS 
   '{"analyzer" : 
      {"custom" : 
        {"tokenizer" : "standard", token_filters: ["stop"]}
     }}');

CALL insert_flush();

SQL

SELECT text, MATCH(table t) AGAINST ('text:The') AS SCORE 
FROM t 
WHERE score 
ORDER BY score DESC;

Empty set (0.01 sec)

SQL

SELECT text, MATCH(TABLE t) AGAINST ('text:the') AS SCORE 
FROM t 
WHERE score 
ORDER BY score DESC;

Empty set (0.01 sec)

SQL

DROP TABLE t;

Example 11c: Custom English Stop Words

Create a table using custom English stop words and insert the word the as a stop word. And query the table for the words The and the. By default, when using the custom stop words token filter, case is ignored, so no results are returned for either query.

SQL

CREATE TABLE t (
  id INT, text VARCHAR(400), 
  SORT KEY (id), 
  FULLTEXT USING VERSION 2 KEY (text) index_options 
   '{"analyzer" : 
      {"custom" : 
        {"tokenizer" : "standard", token_filters: [{"stop": {"words": ["the"]}}]}
     }}');

CALL insert_flush();

SQL

SELECT text, MATCH(table t) AGAINST ('text:The') AS SCORE 
FROM t 
WHERE score 
ORDER BY score DESC;

Empty set (0.06 sec)

SQL

SELECT text, MATCH(TABLE t) AGAINST ('text:the') AS SCORE 
FROM t 
WHERE score 
ORDER BY score DESC;

Empty set (0.05 sec)

SQL

DROP TABLE t;

Example 11d: Custom English Stop Words - Do Not Ignore Case

In the following, the query is modified so case is not ignored. In this example, results are returned for the first query, but not the second.

SQL

CREATE TABLE t (
  id INT, text VARCHAR(400), 
  SORT KEY (id), 
  FULLTEXT USING VERSION 2 KEY (text) INDEX_OPTIONS 
   '{"analyzer" : 
      {"custom" : 
        {"tokenizer" : "standard", 
          token_filters: [{"stop": {"ignoreCase": false,"words": ["the"]}}]}
     }}');
  
CALL insert_flush();

SQL

SELECT text, MATCH(table t) AGAINST ('text:The') AS SCORE 
FROM t 
WHERE score 
ORDER BY score DESC;

*** 1. row ***
 text: The new name highlights the goal of achieving a universal storage format capable of supporting both transactional and analytical use cases.
SCORE: 0.7002022862434387
*** 2. row ***
 text: The decreases in cost of memory slowed over time, and the market for purely in-memory database systems largely failed to materialize, with increasing demand for disk-based OLAP workloads.
SCORE: 0.6494264602661133
2 rows in set (0.06 sec)

SQL

SELECT text, MATCH(TABLE t) AGAINST ('text:the') AS SCORE 
FROM t 
WHERE score 
ORDER BY score DESC;

Empty set (0.03 sec)

SQL

DROP TABLE t;

Example 12: Custom Column Mappings

Note

This feature is an opt-in preview.

Opt-in previews allow you to evaluate and provide feedback on new and upcoming features prior to their general availability.

Example 12a: Per-Column Analyzers

Create a table that uses the french analyzer for the french_content column, the english analyzer for the english_content column, and the standard analyzer for all other columns (title).

SQL

CREATE TABLE custom_col_analyzers (
    title VARCHAR(200),
    french_content VARCHAR(200),
    english_content VARCHAR(200),
    FULLTEXT USING VERSION 2 (title, french_content, english_content)
    INDEX_OPTIONS
        '{
            "analyzer": "standard",
            "mappings": {
                "french_content": {
                    "analyzer": "french"
                 },
                 "english_content": {
                     "analyzer": "english"
                 }
             }
         }'
);

Insert data into the table and optimize the table to ensure all the data is indexed.

SQL

INSERT INTO custom_col_analyzers 
VALUES 
    ("fast", "Nous sommes la base de données la plus rapide.", 
                        "We are the fastest database."),
    ("slow", "Cette base de données est un peu lente.", 
                       "This database is kind of slow."),
    ("slowest", "Cette base de données est la plus lente.", 
                       "This database is the slowest.");

OPTIMIZE TABLE custom_col_analyzers FLUSH;

The following query searches the columns using the per-column analyzers defined earlier and returns any row in which the french_content column contains rapide or the english_content column contains fastest.

SQL

SELECT title,
MATCH(TABLE custom_col_analyzers) 
AGAINST ('french_content:(rapide) OR english_content:(fastest)') AS score 
FROM custom_col_analyzers 
WHERE score > 0
ORDER BY score DESC;

+-------+--------------------+
| title | score              |
+-------+--------------------+
| fast  | 0.6301337480545044 |
+-------+--------------------+

In this query, the french_content column is searched for rapide using the index built with the french analyzer while the english_content column is searched for fast using the index built with the english analyzer.

The following query returns any row in which the french_content column contains rapide or the english_content column contains slowest.

SQL

SELECT title,
MATCH(TABLE custom_col_analyzers) 
AGAINST ('french_content:(rapide) OR english_content:(slowest)') AS score 
FROM custom_col_analyzers 
WHERE score > 0
ORDER BY score DESC;

+---------+---------------------+
| title   | score               |
+---------+---------------------+
| slowest | 0.49662238359451294 |
| fast    |  0.4458314776420593 |
+---------+---------------------+

Every MATCH query requires a prefix that specifies which column to search, thus each column is searched using the appropriate analyzer.

Example 12b: JSON and BSON Keypath Analyzers

Create a table that indexes the json_column$english_content field with the english analyzer and indexes the json_column$french_content field with the french analyzer.

SQL

CREATE TABLE json_keypath_analyzers (
    json_column JSON,
    FULLTEXT USING VERSION 2 KEY(json_column)
    INDEX_OPTIONS '{
        "mappings": {
            "json_column$english_content": {
                "analyzer": "english"
             },
            "json_column$french_content": {
                 "analyzer": "french"
             }
        }
    }'
);

Insert data into the table and optimize the table to ensure all the data is indexed.

SQL

INSERT INTO json_keypath_analyzers
VALUES 
('{"english_content": "We are the fastest database."}'),
('{"french_content": "Nous sommes la base de données la plus rapide."}'),
('{"english_content": "This database is kind of slow."}'),
('{ "french_content": "Cette base de données est un peu lente."}'),
('{"english_content": "his database is the slowest."}'),
('{"french_content": "Cette base de données est la plus lente."}');

OPTIMIZE TABLE json_keypath_analyzers FLUSH;

The following query searches the english_content field using the english analyzer, and the french_content field using the french analyzer. The query returns all the rows with fastest in the english_content field and rapide in the french_content field.

SQL

SELECT json_column,
(MATCH (TABLE json_keypath_analyzers) 
 AGAINST ('json_column$english_content:fastest
          OR json_column$french_content:rapide')
) AS score
FROM json_keypath_analyzers
WHERE score > 0
ORDER BY score desc;

+--------------------------------------------------------------------+--------------------+
| json_column                                                        | score              |
+--------------------------------------------------------------------+--------------------+
| {"english_content":"We are the fastest database."}                 | 0.3150668740272522 |
| {"french_content":"Nous sommes la base de donnes la plus rapide."} | 0.3150668740272522 |
+--------------------------------------------------------------------+--------------------+

The following query searches only for the word fastest in the english_content field.

SQL

SELECT json_column,
(MATCH(TABLE json_keypath_analyzers) 
 AGAINST ('json_column$english_content:fastest')
) AS score
FROM json_keypath_analyzers
WHERE score > 0
ORDER BY score desc;

+----------------------------------------------------+--------------------+
| json_column                                        | score              |
+----------------------------------------------------+--------------------+
| {"english_content":"We are the fastest database."} | 0.3150668740272522 |
+----------------------------------------------------+--------------------+

Supported Language Analyzers

The following table lists the supported language analyzers.

Language	Default Stop Word List Link
`arabic`	Apache Lucene Arabic Stop Words
`bulgarian`	Apache Lucene Bulgarian Stop Words
`bengali`	Apache Lucene Bengali Stop Words
`brazilian_portuguese`	Apache Lucene Brazilian, Portuguese Stop Words
`catalan`	Apache Lucene Catalan Stop Words
`cjk`	Apache Lucene CJK Stop Words
`sorani_kurdish`	Apache Lucene Sorani, Kurdish Stop Words
`czech`	Apache Lucene Czech Stop Words
`danish`	Apache Lucene Danish Stop Words
`german`	Apache Lucene German Stop Words
`greek`	Apache Lucene Greek Stop Words
`english`	"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
`spanish`	Apache Lucene Spanish Stop Words
`estonian`	Apache Lucene Estonian Stop Words
`basque`	Apache Lucene Basque Stop Words
`persian`	Apache Lucene Persian Stop Words
`finnish`	Apache Lucene Finnish Stop Words
`french`	Apache Lucene French Stop Words
`irish`	Apache Lucene Irish Stop Words
`galician`	Apache Lucene Galician Stop Words
`hindi`	Apache Lucene Hindi Stop Words
`hungarian`	Apache Lucene Hungarian Stop Words
`armenian`	Apache Lucene Armenian Stop Words
`indonesian`	Apache Lucene Indonesian Stop Words
`italian`	Apache Lucene Italian Stop Words
`korean`	This is Apache Lucene's Korean (Nori) Analyzer. Filters tokens based on part-of-speech tags: EF, EC, ETN, ETM, IC, JKS, JKC, JKG, JKO, JKB, JKV, JKQ, JX, JC, MAG, MAJ, MM, SP, SSC, SSO, SC, SE, XPN, XSA, XSN, XSV, UNA, NA, VSV. Part of speech tags. Custom stop word lists are not supported with the `korean` analyzer Lucene nori API. Lucene Analyzer for Korean.
`lithuanian`	Apache Lucene Lithuanian Stop Words
`latvian`	Apache Lucene Latvian Stop Words
`nepali`	Apache Lucene Nepali Stop Words
`dutch`	Apache Lucene Dutch Stop Words
`norwegian`	Apache Lucene Norwegian Stop Words
`portuguese`	Apache Lucene Portuguese Stop Words
`romanian`	Apache Lucene Romanian Stop Words
`russian`	Apache Lucene Russian Stop Words
`serbian`	Apache Lucene Serbian Stop Words
`swedish`	Apache Lucene Swedish Stop Words
`tamil`	Apache Lucene Tamil Stop Words
`telegu`	Apache Lucene Telegu Stop Words
`thai`	Apache Lucene Thai Stop Words
`turkish`	Apache Lucene Turkish Stop Words

Supported Tokenizers

The table below lists supported tokenizers. These tokenizers may have custom parameters, which can be obtained and used as described below.

Get Parameters

The parameters and description of each of these tokenizers can be obtained from the links included in the table.

Example: Get parameters for the `uax_url_email` tokenizer

To obtain the parameters for the uax_url_email tokenizer, follow the tokenizer factory link for the uax_url_email tokenizer, which can be found in the middle column of the table below.

The following is the tokenizer factory from the uax_url_email tokenizer, which has been obtained from the tokenizer factory link. This tokenizer has one parameter maxTokenLength, which defaults to 255.

SQL

<fieldType name="text_urlemail" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.UAX29URLEmailTokenizerFactory" maxTokenLength="255"/>
   </analyzer>
 </fieldType>

The INDEX_OPTIONS string to create a full-text index with the uax_url_email tokenizer specifying a maxTokenLength of 300 is shown below.

SQL

INDEX_OPTIONS '{
    "analyzer": {
   	  "custom": {
              "tokenizer": {
                   "uax_url_email" : {
	                 "maxTokenLength": 300
                   }
               }
   	  }
    }
}'

List of Supported Tokenizers

"tokenizer" (Case-Sensitive)	Tokenizer Factory Link (Includes Parameters)	Tokenizer Class Link (Includes Description)
`uax_url_email`	UAX29URLEmailTokenizerFactory	UAX29URLEmailTokenizer
`whitespace`	WhitespaceTokenizerFactory	WhitespaceTokenizer
`classic`	ClassicTokenizerFactory	ClassicTokenizer
`simple_pattern`	SimplePatternTokenizerFactory	SimplePatternTokenizer
`standard`	StandardTokenizerFactory	StandardTokenizer
`keyword`	KeywordTokenizerFactory	KeywordTokenizer
`letter`	LetterTokenizerFactory	LetterTokenizer
`simple_pattern_split`	SimplePatternSplitTokenizerFactory	SimplePatternSplitTokenizer
`pattern`	PatternTokenizerFactory	PatternTokenizer
`thai`	ThaiTokenizerFactory	ThaiTokenizer
`edge_n_gram`	EdgeNGramTokenizerFactory	EdgeNGramTokenizer
`n_gram`	NGramTokenizerFactory	NGramTokenizer
`wikipedia`	WikipediaTokenizerFactory	WikipediaTokenizer
`path_hierarchy`	PathHierarchyTokenizerFactory	PathHierarchyTokenizer
`korean`	Description: Tokenizer for Korean that uses morphological analysis. Supports the following attributes: userDictionary (JSON array of strings): A JSON array of strings; each string is a term in the dictionary. decompoundMode (JSON string): determines how the tokenizer handles POS.Type.COMPOUND, POS.Type.INFLECT, and POS.Type.PREANALYSIS tokens. Values can be 'none', 'discard', 'mixed', the default is 'discard'. outputUnknownUnigrams (JSON boolean value): If "true" outputs unigrams for unknown words. discardPunctuation (JSON boolean value): If "true", punctuation tokens are dropped from the output.

Supported Token Filters

This table lists the supported token filters, the filter name and a link for the token filter factory documentation which provides parameters and description for the token filter.

"token_filters" (Case-Sensitive)	Lucene Link for Parameters and Description
`russian_light_stem`	RussianLightStemFilterFactory
`scandinavian_normalization`	ScandinavialnNormalizationFilterFactory
`decimal_digit`	DecimalDigitFilterFactory
`ascii_folding`	ASCIIFoldingFilterFactory
`german_stem`	GermanStemFilterFactory
`bulgarian_stem`	BulgarianStemFilterFactory
`codepoint_count`	CodepointCountFilterFactory
`pattern_replace`	PatternReplaceFilterFactory
`persian_normalization`	PersianNormalizationFilterFactory
`limit_token_position`	LimitTokenPositionFilterFactory
`porter_stem`	PorterStemFilterFactory
`greek_stem`	GreekStemFilterFactory
`finnish_light_stem`	FinnishLightStemFilterFactory
`fingerprint`	FingerprintFilterFactory
`cjk_width`	CJKWidthFilterFactory
`reverse_string`	ReverseStringFilterFactory
`common_grams`	CommonGramsFilterFactory
`delimited_boost_token`	DelimitedBoostTokenFilterFactory
`scandinavian_folding`	ScandinavianFoldingFilterFactory
`hindi_stem`	HindiStemFilterFactory
`spanish_plural_stem`	SpanishPluralStemFilterFactory
`indonesian_stem`	IndonesianStemFilterFactory
`trim`	TrimFilterFactory
`french_light_stem`	FrenchLightStemFilterFactory
`classic`	ClassicFilterFactory
`fixed_shingle`	FixedShingleFilterFactory
`english_possessive`	EnglishPossessiveFilterFactory
`german_normalization`	GermanNormalizationFilterFactory
`keyword_repeat`	KeywordRepeatFilterFactory
`min_hash`	MinHashFilterFactory
`remove_duplicates_token`	RemoveDuplicatesTokenFilterFactory
`snowball_porter`	SnowballPorterFilterFactory
`german_minimal_stem`	GermanMinimalStemFilterFactory
`norwegian_light_stem`	NorwegianLightStemFilterFactory
`english_minimal_stem`	EnglishMinimalStemFilterFactory
`norwegian_minimal_stem`	NorwegianMinimalStemFilterFactory
`czech_stem`	CzechStemFilterFactory
`sorani_stem`	SoraniStemFilterFactory
`limit_token_offset`	LimitTokenOffsetFilterFactory
`persian_stem`	PersianStemFilterFactory
`common_grams_query`	CommonGramsQueryFilterFactory
`sorani_normalization`	SoraniNormalizationFilterFactory
`swedish_light_stem`	SwedishLightStemFilterFactory
`k_stem`	KStemFilterFactory
`french_minimal_stem`	FrenchMinimalStemFilterFactory
`hyphenated_words`	HyphenatedWordsFilterFactory
`capitalization`	CapitalizationFilterFactory
`lower_case`	LowerCaseFilterFactory
`hungarian_light_stem`	HungarianLightStemFilterFactory
`telugu_stem`	SynonymGraphFilterFactory
`italian_light_stem`	ItalianLightStemFilterFactory
`limit_token_count`	LimitTokenCountFilterFactory
`swedish_minimal_stem`	SwedishLightStemFilterFactory
`galician_minimal_stem`	GalicianMinimalStemFilterFactory
`portuguese_minimal_stem`	PortugueseMinimalStemFilterFactory
`bengali_normalization`	BengaliNormalizationFilterFactory
`galician_stem`	GalicianStemFilterFactory
`turkish_lower_case`	TurkishLowerCaseFilterFactory
`bengali_stem`	BengaliStemFilterFactory
`indic_normalization`	IndicNormalizationFilterFactory
`keep_word`	KeepWordFilterFactory
`drop_if_flagged`	DictionaryCompoundWordTokenFilterFactory
`latvian_stem`	LatvianStemFilterFactory
`portuguese_light_stem`	PortugueseLightStemFilterFactory
`apostrophe`	ApostropheFilterFactory
`arabic_stem`	ArabicStemFilterFactory
`delimited_term_frequency_token`	DelimitedTermFrequencyTokenFilterFactory
`irish_lower_case`	IrishLowerCaseFilterFactory
`edge_n_gram`	EdgeNGramFilterFactory
`german_light_stem`	GermanLightStemFilterFactory
`pattern_capture_group`	PatternCaptureGroupFilterFactory
`spanish_light_stem`	SpanishLightStemFilterFactory
`hindi_normalization`	HindiNormalizationFilterFactory
`norwegian_normalization`	NorwegianNormalizationFilterFactory
`shingle`	ShingleFilterFactory
`telugu_normalization`	SynonymGraphFilterFactory
`date_recognizer`	DateRecognizerFilterFactory
`n_gram`	NGramFilterFactory
`upper_case`	UpperCaseFilterFactory
`brazilian_stem`	BrazilianStemFilterFactory
`cjk_bigram`	CJKBigramFilterFactory
`truncate_token`	TruncateTokenFilterFactory
`greek_lower_case`	GreekLowerCaseFilterFactory
`length`	LengthFilterFactory
`arabic_normalization`	ArabicNormalizationFilterFactory
`portuguese_stem`	PortugueseStemFilterFactory
`elision`	ElisionFilterFactory
`korean_part_of_speech`	KoreanPartOfSpeechStopFilterFactory A token filter that removes tokens that match a set of part-of-speech tags
`korean_reading_form`	KoreanReadingFormFilterFactory A token filter that rewrites tokens written in Hanja to their Hangul form.
`korean_number`	KoreanNumberFilterFactory A token filter that normalizes Korean numbers to Arabic decimal numbers in half-width characters.
`stop`	A custom token filter that removes stop words from a token stream. Custom Stop Words

Supported Character Filters

This table lists the supported character filters, the name, and a link for the parameters.

"char_filters" (case-sensitive)	Lucene Link for Parameters
`persian`	PersianCharFilterFactory
`cjk_width`	CJKWidthCharFilterFactory
`html_strip`	HTMLStripCharFilterFactory
`pattern_replace`	PatternReplaceCharFilterFactory

Full Text VERSION 2 Custom Analyzers

On this page

Specify an Analyzer

Analyzers

Built-in Analyzers

Custom Analyzers

Common Tokenizers

Common Token Filters

Custom Stop Words

Common Character Filters

Custom Column Mappings

Use Per-Column Analyzers

JSON and BSON Keypath Analyzers

Example of Using Parameters

Stemming

NGrams

Examples

Example 1: Custom Analyzer with Whitespace tokenizer

Example 2: Custom Analyzer with Whitespace tokenizer, html_strip as char_filters, and lower_case as token_filters

Example 3: Custom Analyzer with standard tokenizer and cjk_width as token_filter

Example 4: Korean (nori) Analyzer

Example 5: Korean (nori) Analyzer with User Dictionary

Example 5a - Search for 수영장 with and without User Dictionary

Example 5b - Search for 수영 with and without User Dictionary

Example 6: Custom Analyzer, standard tokenizer, Italian language, snowball_porter stemmer, elision filter

Example 7: Custom Analyzer and N Gram Tokenizer

Example 8: Custom Analyzer with N-Gram Tokenizer, html_strip as char_filters, and lower_case as token_filters

Example 9: Portuguese Analyzer with score

Example 10: Spanish Analyzer with custom stop words

Example 11: Standard Tokenizer with Custom Stop Words

Example 11a: No Stop Words

Example 11b: Default English Stop Words

Example 11c: Custom English Stop Words

Example 11d: Custom English Stop Words - Do Not Ignore Case

Example 12: Custom Column Mappings

Example 12a: Per-Column Analyzers

Example 12b: JSON and BSON Keypath Analyzers

Supported Language Analyzers

Supported Tokenizers

Get Parameters

Example: Get parameters for the uax_url_email tokenizer

List of Supported Tokenizers

Supported Token Filters

Supported Character Filters

Was this article helpful?

On this page

Was this article helpful?

Example: Get parameters for the `uax_url_email` tokenizer