Full Text VERSION 2 Custom Analyzers

A tokenizer takes a stream of characters and breaks that stream into individual tokens, for example split on whitespace characters. A character filter takes the stream of text data and transforms it in a pre-defined way, for example, removing all HTML tags. Token filters receive a stream of tokens and may add, change, or remove tokens, for example, lowercase all tokens or remove stop words.

Users can choose from a list of pre-configured analyzers and use them without any modifications. Users can also create their own analyzers by specifying a tokenizer, character filters, and token filters to obtain a fully customized search experience.

Refer to Working with Full-Text Search for more information on full-text search.

Specify an Analyzer

Specify an analyzer by passing an analyzer configuration in JSON format to INDEX_OPTIONS, which is a JSON string that contains the index configuration. In this JSON, the analyzer key is a string or a nested JSON value.

Specify the name of a built-in analyzer (e.g.: standard, cjk, etc.) as a string.
Specify a customized built-in analyzer or a custom analyzer as a nested JSON value.

The three examples below show a built-in analyzer with no customizations, a built-in analyzer with a customized set of stop words, and a custom analyzer. A full set of examples can be found in Examples. Refer to Analyzers for details on analyzers.

Specify the built-in analyzer for Chinese, Japanese, and Korean characters, called the cjk analyzer, with no customizations.

SQL

CREATE TABLE t (
	title VARCHAR(200),
	content VARCHAR(200),
	FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
        '{ "analyzer": "cjk"}'
);

Specify the built-in cjk analyzer with a customized set of stop words. Built-in analyzers can be customized with custom stop word lists; no other customizations for built-in analyzers are supported.

SQL

CREATE TABLE t (
	title VARCHAR(200),
	content VARCHAR(200),
	FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
        '{"analyzer": {
               "cjk": {
                    "stopset": [
                           "这",
            	           "那"
                     ]
                }
          }
        }'
);

Note

In addition to the cjk analyzer, the Korean nori analyzer is also supported.

Specify a custom analyzer, which uses the whitespace tokenizer, the html_strip character filter, and the lower_case token filter. The analyzer name must be custom. Additional character and token filters can be specified by adding additional char_filters and token_filters key pairs.

SQL

CREATE TABLE t (
	title VARCHAR(200),
	content VARCHAR(200),
	FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
        '{
	     "analyzer": {
	         "custom": {
	              "tokenizer": "whitespace",
	              "char_filters": ["html_strip"],
  	              "token_filters": ["lower_case"],
                 }
	     }
         }'
);

Analyzers

There are two types of analyzers: built-in analyzers and custom analyzers.

Note

The examples in this section show only the INDEX_OPTIONS string (JSON) and omit the rest of the index creation command.

Built-in Analyzers

Built-in analyzers are pre-configured analyzers including the standard analyzer and language-specific analyzers and do not require configuration. Built-in analyzers may be customized with custom stop-word lists.

The default analyzer is the Apache Lucene standard analyzer, which uses the Apache Lucene standard tokenizer, lowercase token filters, and no stop words.

Specify a built-in analyzer, without customizations, by specifying the name of the analyzer as the value of the analyzer key.

The following example specifies the use of the spanish language analyzer.

SQL

INDEX_OPTIONS '{"analyzer" : "spanish"}'

A custom stop word list can be specified for a built-in analyzer by specifying a stopset in the JSON as shown in the following example. A custom stop word list is the only customization supported for built-in analyzers.

The following example specifies a custom stop word list for the standard analyzer.

The value of the analyzer key is a nested JSON value consisting of a key-value pair with key being the name of the analyzer (spanish in this example), and the value being another key-value pair consisting of the key stopset, and the value a JSON array of stop words.

SQL

INDEX_OPTIONS '{
	"analyzer": {
    	     "spanish": {
        	     "stopset": [
            	     "el",
            	     "la"
        	     ]
    	     }
	}
}'

SingleStore recommends using the default language analyzer, without stop word customization, in most cases, e.g. '{"analyzer" : "catalan"}'.

Refer to Supported Language Analyzers for links to the default list of stop words for each analyzer.

Custom Analyzers

Create a custom analyzer by using the analyzer name custom and by specifying a tokenizer and optional token and character filters.

A custom analyzer must specify:

A required tokenizer - A tokenizer breaks up incoming text into tokens. In many cases, an analyzer will use a tokenizer as the first step in the analysis process. However, to modify text prior to tokenization, use char_filters (see below).
An optional array of token_filters: A token_filter modifies tokens that have been created by the tokenizer. Common modifications performed by a token_filter are deletion, stemming, and case folding.
An optional array of char_filters: A char_filter transforms the text before it is tokenized, while providing corrected character offsets to account for these modifications.

The example below shows the use of all three components, tokenizer, char_filters, and token_filters.

SQL

INDEX_OPTIONS '{
     "analyzer" : {
          "custom": {
	       "tokenizer": "whitespace",
	       "char_filters": ["html_strip"],
  	       "token_filters": ["lower_case"],
          }
     }
}'

Each of these three components (tokenizer, char_filters, token_filters) can be specified as a string with the name of the component or as a nested JSON with a configuration for the component.

The example below specifies a custom analyzer that uses the whitespace tokenizer, with a maximum length of 256 characters.

SQL

INDEX_OPTIONS '{
     "analyzer": {
          "custom": {
               "tokenizer": {
            	     "whitespace": {
                      "maxTokenLen": 256
            	     }
        	}
    	  }
     }
}'

Common Tokenizers

Common tokenizers that are supported are listed in the table below. Refer to Supported Tokenizers for a full list of supported tokenizers.

"tokenizer" (case-sensitive)	Parameters	Description
`whitespace`	`rule` (Optional, string). Defaults to `"unicode"`. `maxTokenLen` (Optional, integer). Defaults to `256`. WhitespaceTokenizerFactory	Divides text at whitespace characters as defined by Character.isWhitespace(int). This definition excludes non-breaking spaces from whitespace characters. WhitespaceTokenizer
`standard`	`maxTokenLength` (Optional, integer). Defaults to 255. StandardTokenizerFactory	Implements Word Break rules from Unicode Text Segmentation: Unicode Standard Annex #29. StandardTokenizer (Lucene 6.6.0 API)
`n_gram`	`minGramSize` (Optional, integer). Defaults to `1`. `maxGramSize` (Optional, integer). Defaults to `2`. NGramTokenizerFactory	Tokenizes the input into n-grams of the specified size(s). NGramTokenizer
`uax_url_email`	`maxTokenLength` (Optional, integer). Defaults to `255`. UAX29URLEmailTokenizerFactory	Implements Word Break rules from Unicode Text Segmentation: Unicode Standard Annex #29. URLs and email addresses are also tokenized. UAX29URLEmailTokenizer

"tokenizer" (case-sensitive)

Parameters

Description

whitespace

rule (Optional, string). Defaults to "unicode".

maxTokenLen (Optional, integer). Defaults to 256.

WhitespaceTokenizerFactory

Divides text at whitespace characters as defined by Character.isWhitespace(int). This definition excludes non-breaking spaces from whitespace characters.

WhitespaceTokenizer

standard

maxTokenLength (Optional, integer). Defaults to 255.

StandardTokenizerFactory

Implements Word Break rules from Unicode Text Segmentation: Unicode Standard Annex #29.

StandardTokenizer (Lucene 6.6.0 API)

n_gram

minGramSize (Optional, integer). Defaults to 1.

maxGramSize (Optional, integer). Defaults to 2.

NGramTokenizerFactory

Tokenizes the input into n-grams of the specified size(s).

NGramTokenizer

uax_url_email

maxTokenLength (Optional, integer). Defaults to 255.

UAX29URLEmailTokenizerFactory

Implements Word Break rules from Unicode Text Segmentation: Unicode Standard Annex #29. URLs and email addresses are also tokenized.

UAX29URLEmailTokenizer

Common Token Filters

Common token filters that are supported are listed in the table below. Refer to Supported Token Filters for a full list of supported token filters.

"token_filters" (Case-Sensitive)	Parameters	Description
`lower_case`	No parameters. LowerCaseFilterFactory	Normalizes token text to lower case. LowerCaseFilter
`snowball_porter`	`protected` (Optional, string). Defaults to `"protectedkeyword.txt"`. `language` (Optional, string). Defaults to `"English"`. SnowballPorterFilterFactory	Stems words using a Snowball-generated stemmer. Available stemmers are listed in `org.tartarus.snowball.ext`. SnowballFilter
`n_gram`	`minGramSize` (Optional, integer). Defaults to `1`. `maxGramSize` (Optional, integer). Defaults to `2`. `preserveOriginal` (Optional, boolean). Defaults to `"true"`. NGramFilterFactory	Tokenizes the input into n-grams of the given size(s). NGramTokenFilter
`stop`	`words` (Optional, array of stop words) `ignoreCase` (Optional, boolean). If true, all words are lower-cased first. Defaults to `false`.	Custom stop words token filter. Removes stop words from a token stream.

"token_filters" (Case-Sensitive)

Parameters

Description

shingle

minShingleSize (Optional, integer). Defaults to 2.

maxShingleSize (Optional, integer). Defaults to 2.

ShingleFilterFactory

Constructs shingles (token n-grams), that is it creates combinations of tokens as a single token.

ShingleFilter

lower_case

No parameters.

LowerCaseFilterFactory

Normalizes token text to lower case.

LowerCaseFilter

snowball_porter

protected (Optional, string). Defaults to "protectedkeyword.txt".

language (Optional, string). Defaults to "English".

SnowballPorterFilterFactory

Stems words using a Snowball-generated stemmer. Available stemmers are listed in org.tartarus.snowball.ext.

SnowballFilter

n_gram

minGramSize (Optional, integer). Defaults to 1.

maxGramSize (Optional, integer). Defaults to 2.

preserveOriginal (Optional, boolean). Defaults to "true".

NGramFilterFactory

Tokenizes the input into n-grams of the given size(s).

NGramTokenFilter

stop

words (Optional, array of stop words)

ignoreCase (Optional, boolean). If true, all words are lower-cased first. Defaults to false.

Custom stop words token filter.

Removes stop words from a token stream.

Custom Stop Words

SingleStore provides a custom token filter named stop which allows a set of custom stop words to be specified. The stop token filter works with any custom analyzer.

The stop token filter has two parameters words and ignoreCase.

words: An optional parameter containing a list of stop words. The list of stop words must be specified as a JSON array.
ignoreCase: An optional boolean parameter indicating if case should be ignored. If set to true, all words are lower-cased before tokenization. Defaults to false.

Sample syntax for this custom token filter is as follows. Refer to Example 11: Standard Tokenizer with Custom Stop Words for additional examples of using the stop token filter.

SQL

  FULLTEXT USING VERSION 2 KEY (text) INDEX_OPTIONS
   '{"analyzer" :
      {"custom" :
        {"tokenizer" : "standard",
          token_filters: [{"stop":
                            {"ignoreCase": false,
                             "words": ["the"]}}
                         ]}
     }}');

Common Character Filters

Common character filters that are supported are listed in the table below. Refer to Supported Character Filters for a full list of supported character filters.

"char_filters" (case-sensitive)	Parameters (includes Lucene Link)	Description (includes Lucene Link)
`html_strip`	`escapedTags` (Optional, string). Defaults to `"a, title"`. HTMLStripCharFilterFactory	Wraps another Reader and attempts to strip out HTML. HTMLStripCharFilter

"char_filters" (case-sensitive)

Parameters (includes Lucene Link)

Description (includes Lucene Link)

html_strip

escapedTags (Optional, string). Defaults to "a, title".

HTMLStripCharFilterFactory

Wraps another Reader and attempts to strip out HTML.

HTMLStripCharFilter

Example of Using Parameters

Specify a default uax_url_email tokenizer:

SQL

INDEX_OPTIONS '{
     "analyzer": {
           "custom": {
        	"tokenizer": "uax_url_email"
    	   }
     }
}'

Specify a uax_url_email tokenizer with custom parameters:

SQL

INDEX_OPTIONS '{
     "analyzer": {
    	  "custom": {
               "tokenizer": {
                    "uax_url_email" : {
	                 "maxTokenLength": 300
                    }
                }
    	  }
     }
}'

Stemming

Stemming transforms words to their root form, often by removing suffixes and prefixes. In English, for example, the words "dressing" and "dressed", can be stemmed to "dress". This allows a search for one form of a verb (e.g. "dressing") to return documents containing other forms of the verb (e.g. "dressed" or "dress"). Stemming is language specific.

In SingleStore, stemming can be handled in two ways:

Use a built-in analyzer available from JLucene that incorporates stemming.
- Many of the language-specific analyzers from JLucene do stem; however, the standard analyzer from JLucene does not stem.
Use a custom analyzer with a token filter such as elision or snowball_porter to customize stemming.

The following CREATE TABLE statement creates a full-text index using the spanish analyzer, which stems for the Spanish language.

SQL

CREATE TABLE spanish_lang (
	text VARCHAR(200)
	FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
        '{ "analyzer": "spanish"}'
);

The following CREATE TABLE statement uses a custom analyzer with custom token filters to stem for Italian text. Refer to Example 6 for the full example.

SQL

CREATE TABLE italian_architecture (
  architecture VARCHAR(400),
  description VARCHAR(400),
  SORT KEY (architecture),
  FULLTEXT USING VERSION 2 KEY(description)
  INDEX_OPTIONS '{"analyzer" :
        {"custom" : {"tokenizer" : "standard",
                     "token_filters": ["elision",
                                      {"snowball_porter" : {"language": "Italian"}}]}}}}'
);

NGrams

NGram tokenizers split words into small pieces and are good for fast "fuzzy-style" matching using a full-text index. The minimum and maximum gram length is customizable. Refer to Example 7, Example 8, and Example 12 for examples of using a ngram tokenizer.

Examples

Example 1: Custom Analyzer with Whitespace tokenizer

Use a custom analyzer and a whitespace tokenizer to search for text with a hyphen in queries.

Create a table, insert data, and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE medium_articles (
   title VARCHAR(200),
   summary TEXT,
   FULLTEXT USING VERSION 2 (summary) INDEX_OPTIONS
   '{
       "analyzer": {
           "custom": {
               "tokenizer": "whitespace"
           }
       }
   }'
);
INSERT INTO medium_articles (title, summary) VALUES
('Build Real-Time Multimodal RAG Applications Using SingleStore!','This guide teaches you how to build a multimodal Retrieval-Augmented Generation (RAG) application using SingleStore, integrating various data types for enhanced AI responses.'),
('Building Production-Ready AI Agents with LangGraph: A Real-Life Use Case','This guide offers a solution for creating a scalable, production-ready multi-modal chatbot using LangChain, focusing on dividing tasks for improved control and efficiency.'),
('Scaling RAG from POC to Production','This guide explains Retrieval-Augmented Generation (RAG) for building reliable, context-aware applications using large language models (LLMs) and emphasizes the importance of scaling from proof of concept to production.'),
('Tech Stack For Production-Ready LLM Applications In 2024','This guide reviews preferred tools for the entire LLM app development lifecycle, emphasizing simplicity and ease of use in building scalable AI applications.'),
('LangGraph + Gemini Pro + Custom Tool + Streamlit = Multi-Agent Application Development','This guide teaches you to create a chatbot using LangGraph and Streamlit, leveraging LangChain for building stateful multi-actor applications that respond to user support requests.');
OPTIMIZE TABLE medium_articles FLUSH;

Observe the difference between the results of the two search queries below.

SQL

SELECT *
FROM medium_articles
WHERE MATCH(TABLE medium_articles) AGAINST ("summary:multimodal");

+----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| title                                                          | summary                                                                                                                                                                        |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Build Real-Time Multimodal RAG Applications Using SingleStore! | This guide teaches you how to build a multimodal Retrieval-Augmented Generation (RAG) application using SingleStore, integrating various data types for enhanced AI responses. |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

SQL

SELECT *
FROM medium_articles
WHERE MATCH(TABLE medium_articles) AGAINST ("summary:multi-modal");

+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| title                                                                    | summary                                                                                                                                                                     |
+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Building Production-Ready AI Agents with LangGraph: A Real-Life Use Case | This guide offers a solution for creating a scalable, production-ready multi-modal chatbot using LangChain, focusing on dividing tasks for improved control and efficiency. |
+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Example 2: Custom Tokenizer, Character Filter, and Token Filter to Search for HTML Entities

Use a custom analyzer, a whitespace tokenizer, html_strip as a character filter, and lower_case as a token filter to search for HTML entities in queries.

A character filter receives the original text data and converts it into a predefined format. A token filter receives a stream of tokens and can add, change, or remove tokens as needed.

In this example, html_strip as a character filter removes HTML tags and lower_case as a token filter lowercases the tokens.

Create a table, insert data, and optimize the table to ensure all data is included in results.

Search for HTML entities in queries.

SQL

CREATE TABLE html_table (
   title VARCHAR(200),
   content VARCHAR(200),
   FULLTEXT USING VERSION 2 (title, content) INDEX_OPTIONS
  '{
     "analyzer": {
         "custom": {"char_filters": ["html_strip"],
                    "tokenizer": "whitespace",
                    "token_filters":["lower_case"]
                    }
     }
  }'
);
INSERT INTO html_table (title, content) VALUES
('Exciting News', 'We&apos;re thrilled to announce our new project!</p>'),
('Learning Journey', 'Learning is a never-ending journey &amp; I&apos;m excited!</p>'),
('Success Story', 'Our team has achieved great things &amp; we&apos;re proud!</p>'),
('Grateful Heart', 'Thank you for being a part of our journey &amp; supporting us!</p>'),
('Future Goals', 'We&apos;re looking forward to achieving even more!</p>');
OPTIMIZE TABLE html_table FLUSH;

Search for HTML entity, and observe the result of the search query.

SQL

SELECT *
FROM html_table
WHERE MATCH(TABLE html_table) AGAINST("content:we're");

+---------------+----------------------------------------------------------------+
| title         | content                                                        |
+---------------+----------------------------------------------------------------+
| Success Story | Our team has achieved great things &amp; we&apos;re proud!</p> |
| Exciting News | We&apos;re thrilled to announce our new project!</p>           |
| Future Goals  | We&apos;re looking forward to achieving even more!</p>         |
+---------------+----------------------------------------------------------------+

Example 3: Custom Analyzer, standard Tokenizer, and Custom Token Filter (`cjk_width`) to Search Japanese Text

Use a custom analyzer, a standard tokenizer, and cjk_width as a token filter to search for a Japanese text in queries.

In this example, cjk_width as a token filter normalizes the width differences in CJK (Chinese, Japanese, and Korean) characters.

Create a table, insert data and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE japanese_novels (
 title VARCHAR(200),
 content VARCHAR(200),
 FULLTEXT USING VERSION 2 (title, content) INDEX_OPTIONS
'{
   "analyzer": {
       "custom": {"tokenizer": "standard",
                  "token_filters":["cjk_width"]
                  }
   }
}'
);
INSERT INTO japanese_novels (title, content) VALUES
('ノルウェイの森', '村上春樹の代表作で、愛と喪失をテーマにしています。'),
('吾輩は猫である', '夏目漱石による作品で、猫の視点から人間社会を描いています。'),
('雪国', '川端康成の作品で、美しい雪景色と切ない恋を描いています。'),
('千と千尋の神隠し', '宮崎駿の作品で、少女が異世界で成長する物語です。'),
('コンビニ人間', '村田沙耶香の作品で、現代社会の孤独と適応を描いています。');
OPTIMIZE TABLE japanese_novels FLUSH;

Observe the result of the search query for the Japanese text below.

SQL

SELECT *
FROM japanese_novels
WHERE MATCH(TABLE japanese_novels) AGAINST("content: 夏");

+-----------------------+-----------------------------------------------------------------------------------------+
| title                 | content                                                                                 |
+-----------------------+-----------------------------------------------------------------------------------------+
| 吾輩は猫である        | 夏目漱石による作品で、猫の視点から人間社会を描いています。                              |
+-----------------------+-----------------------------------------------------------------------------------------+

Example 4: Korean (nori) Analyzer

Use the korean analyzer to search for Korean text in queries. This analyzer is also known as the nori analyzer.

Create a table, insert data and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE k_drama (
 genre VARCHAR(200),
 movie_name TEXT,
 cast TEXT,
 FULLTEXT USING VERSION 2 (genre) INDEX_OPTIONS
 '{
     "analyzer": "korean"
 }'
);
INSERT INTO k_drama (genre, movie_name, cast)
VALUES
 ('로맨스', '사랑의 불시착', '현빈, 손예진'),
 ('액션, 스릴러', '빈센조', '송중기, 전여빈'),
 ('드라마, 로맨스', '도깨비', '공유, 김고은'),
 ('사극, 드라마', '미스터 션샤인', '이병헌, 김태리'),
 ('코미디, 로맨스', '김비서가 왜 그럴까', '박서준, 박민영');
OPTIMIZE TABLE k_drama FLUSH;

Observe the result of the search query for the Korean text below.

SQL

SELECT * FROM k_drama WHERE MATCH(TABLE k_drama) AGAINST("genre:로맨스");

+----------------------+----------------------------+----------------------+
| genre                | movie_name                 | cast                 |
+----------------------+----------------------------+----------------------+
| 드라마, 로맨스          | 도깨비                       | 공유, 김고은            |
| 코미디, 로맨스          | 김비서가 왜 그럴까              | 박서준, 박민영          |
| 로맨스                | 사랑의 불시착                  | 현빈, 손예진            |
+----------------------+----------------------------+----------------------+

Example 5: Korean (nori) Analyzer with User Dictionary

Use the korean analyzer (also known as the nori analyzer) with and without a user dictionary to search for a Korean text in queries. This example demonstrates how a user dictionary can be used to add words, specifically compound words, to the dictionary used by the korean analyzer. Note that the user dictionary is augmentative, meaning that it adds the specified words to the existing dictionary, it does not replace the existing dictionary.

Create two tables with a full text index with the korean analyzer, one with and one without a user dictionary. The Korean compound word 수영장, which translates to "swimming pool" in English is inserted in the user dictionary.

SQL

CREATE TABLE korean_user_dict
    (id INT,
     phrase VARCHAR(400),
     FULLTEXT USING VERSION 2 (phrase) INDEX_OPTIONS
       '{"analyzer" :
         {"custom":
          {"tokenizer":
           {"korean": {"userDictionary": ["수영장"]}
           }
         }
        }
}');

SQL

CREATE TABLE korean
     (id INT,
      phrase varchar(400),
      FULLTEXT USING VERSION 2 (phrase) INDEX_OPTIONS
        '{
           "analyzer": "korean"
         }'
);

Insert data into the tables and optimize them to ensure all data is included in results.

SQL

INSERT INTO korean_user_dict
    VALUES (1, "수영장"), (2, "수영"), (3, "장");
OPTIMIZE TABLE korean_user_dict FLUSH;

SQL

INSERT INTO korean
    VALUES (1, "수영장"), (2, "수영"), (3, "장");
OPTIMIZE TABLE korean FLUSH;

When 수영장 is inserted into the korean_user_dict table, because 수영장 is in the user dictionary, 수영장 is tokenized as a single, atomic token and will not be further tokenized.

In contrast, when 수영장 is inserted into the korean table, 수영장 is tokenized into 수영 and 장. By default, compound words are decomposed, and the original form is discarded (decompoundMode is discard by default).

Example 5a - Search for 수영장 with and without User Dictionary

The following queries search for the compound word 수영장 in both the korean_user_dict and korean tables. This and the following example demonstrate searching with and without a user dictionary.

The ORDER BY clause is included to ensure consistent ordering of results.

SQL

SELECT id, phrase, BM25(korean_user_dict, 'phrase:(수영장)') AS score
FROM korean_user_dict
ORDER BY id;

+------+-----------------+--------------------+
| id   | text            | score              |
+------+-----------------+--------------------+
|    1 | 수영장            | 0.4458314776420593 |
|    2 | 수영              |                  0 |
|    3 | 장               |                  0 |
+------+-----------------+--------------------+

SQL

SELECT id, phrase, BM25(korean, 'phrase:(수영장)') AS SCORE
FROM korean
ORDER BY id;

+------+-----------------+---------------------+
| id   | text            | score               |
+------+-----------------+---------------------+
|    1 | 수영장            | 0.35471969842910767 |
|    2 | 수영             | 0.23797652125358582 |
|    3 | 장               | 0.23797652125358582 |
+------+-----------------+---------------------+

When searching the korean_user_dict table, 수영장 matches only 수영장 and not 수영 or 장 because 수영장 is tokenized as a single, atomic token.

In contrast, when searching the korean table, 수영장 matches 수영장, 수영, and 장 because 수영장 is tokenized as two tokens: 수영, and 장, hence 수영장 partially matches all three rows in the table.

Finally, the score for 수영장 is higher when searching the table korean_user_dict than when searching the table korean because 수영장 matches only a single row in korean_user_dict.

Example 5b - Search for 수영 with and without User Dictionary

The following queries search for the word 수영 in the tables with and without user dictionary. The ORDER BY clause is included to ensure consistent ordering of results.

SQL

SELECT id, phrase, BM25(korean_user_dict, 'phrase:(수영)') AS SCORE
FROM korean_user_dict ORDER BY id;

+------+-----------------+--------------------+
| id   | text            | SCORE              |
+------+-----------------+--------------------+
|    1 | 수영장            |                  0 |
|    2 | 수영             | 0.4458314776420593 |
|    3 | 장               |                  0 |
+------+-----------------+--------------------+

SQL

SELECT id, phrase, BM25(korean, 'phrase:(수영)') AS SCORE
FROM korean ORDER BY id;

+------+-----------------+---------------------+
| id   | text            | SCORE               |
+------+-----------------+---------------------+
|    1 | 수영장            | 0.17735984921455383 |
|    2 | 수영             |  0.23797652125358582 |
|    3 | 장               |                   0 |
+------+-----------------+---------------------+

When searching the korean_user_dict table, 수영 matches only itself, because, as described earlier, 수영장 has been inserted in the user dictionary and 수영장 is tokenized as a single token, so it does not match 수영.

In contrast, when searching the korean table, 수영 matches 수영장 and 수영 because 수영장 is tokenized as two tokens: 수영, and 장.

In addition, the score for 수영 is higher when searching the table korean_user_dict than when searching the table korean because 수영 matches only a single row in korean_user_dict.

Example 6: Custom Analyzer, standard Tokenizer, Custom Token Filters (`elision`, `snowball_porter`) to Search Italian Text

Use a custom analyzer, a standard tokenizer, elision and snowball_porter as token filters for the Italian language to search for an Italian text.

In this example, elision as a token filter removes specific elisions from the input token. Using snowball_porter as a token filter stems the words using the Lucene Snowball stemmer tokenization. The snowball_porter token filter requires a language parameter to control the stemmer.

Create a table, insert data and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE italian_architecture (
   architecture VARCHAR(400),
   description VARCHAR(400),
   SORT KEY (architecture),
   FULLTEXT USING VERSION 2 KEY(description)
   INDEX_OPTIONS '{"analyzer" :
         {"custom" : {"tokenizer" : "standard",
                      "token_filters": ["elision",
                                       {"snowball_porter" : {"language": "Italian"}}]}}}}'
);
INSERT INTO italian_architecture (architecture, description) VALUES
('Colosseo', 'Un antico anfiteatro situato a Roma, noto per i combattimenti dei gladiatori.'),
('Torre Pendente di Pisa', 'Un campanile famoso per la sua inclinazione non intenzionale.'),
('Basilica di San Pietro', 'Una chiesa rinascimentale in Vaticano, famosa per la sua cupola progettata da Michelangelo.'),
('Duomo di Milano', 'L’architettura di Milano, nota per la sua straordinaria architettura gotica e le guglie.'),
('Palazzo Ducale', 'Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.');
OPTIMIZE TABLE italian_architecture FLUSH;

Observe the result of the search query for the Italian text below.

SQL

SELECT *
FROM italian_architecture
WHERE MATCH(TABLE italian_architecture) AGAINST("description:l’architettura");

+-----------------+--------------------------------------------------------------------------------------------+
| architecture    | description                                                                                |
+-----------------+--------------------------------------------------------------------------------------------+
| Duomo di Milano | L’architettura di Milano, nota per la sua straordinaria architettura gotica e le guglie.   |
| Palazzo Ducale  | Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.        |
+-----------------+--------------------------------------------------------------------------------------------+

Use a custom analyzer, a standard tokenizer, snowball_porter as token filters for the Italian language without elision token filter to search for an Italian text in queries.

Create a second table, insert data and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE italian_architecture_2 (
   architecture VARCHAR(400),
   description VARCHAR(400),
   SORT KEY (architecture),
   FULLTEXT USING VERSION 2 KEY(description)
   INDEX_OPTIONS '{"analyzer" :
         {"custom" : {"tokenizer" : "standard",
                      "token_filters": {"snowball_porter" : {"language": "Italian"}}}}}}'
);
INSERT INTO italian_architecture_2 (architecture, description) VALUES
('Colosseo', 'Un antico anfiteatro situato a Roma, noto per i combattimenti dei gladiatori.'),
('Torre Pendente di Pisa', 'Un campanile famoso per la sua inclinazione non intenzionale.'),
('Basilica di San Pietro', 'Una chiesa rinascimentale in Vaticano, famosa per la sua cupola progettata da Michelangelo.'),
('Duomo di Milano', 'L’architettura di Milano, nota per la sua straordinaria architettura gotica e le guglie.'),
('Palazzo Ducale', 'Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.');
OPTIMIZE TABLE italian_architecture_2 FLUSH;

Observe the result of the search query for the Italian text without elision token filter below.

SQL

SELECT *
FROM italian_architecture_2
WHERE MATCH(TABLE italian_architecture_2) AGAINST("description:l’architettura");

+----------------+---------------------------------------------------------------------------------------+
| architecture   | description                                                                           |
+----------------+---------------------------------------------------------------------------------------+
| Palazzo Ducale | Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.   |
+----------------+---------------------------------------------------------------------------------------+

Example 7: Custom Analyzer and N Gram Tokenizer

Use a custom analyzer and a n_gram tokenizer to search for misspelled text in queries.

Create a table, insert data and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE university
   (name VARCHAR(400),
   admission_page VARCHAR(400),
   SORT KEY (name),
   FULLTEXT USING VERSION 2 KEY(admission_page)
   INDEX_OPTIONS '{"analyzer" : {"custom" : {"tokenizer" : "n_gram"}}}}');
INSERT INTO university (name, admission_page) VALUES
('Harvard University', 'college.harvard.edu/admissions'),
('Stanford University', 'stanford.edu/admission/'),
('Massachusetts Institute of Technology (MIT)', 'mitadmissions.org/'),
('California Institute of Technology (Caltech)', 'admissions.caltech.edu/'),
('University of Chicago', 'uchicago.edu/en/admissions');
OPTIMIZE TABLE university FLUSH;

Observe the result of the search query for the misspelled text and compare the search result with the score below.

SQL

SELECT name,admission_page, MATCH(TABLE university) AGAINST("admission_page:cattec") AS score
FROM university
WHERE score
ORDER BY score DESC;

+----------------------------------------------+--------------------------------+---------------------+
| name                                         | admission_page                 | score               |
+----------------------------------------------+--------------------------------+---------------------+
| California Institute of Technology (Caltech) | admissions.caltech.edu/        |  2.4422175884246826 |
| University of Chicago                        | uchicago.edu/en/admissions     |  0.8550153970718384 |
| Harvard University                           | college.harvard.edu/admissions |  0.6825864911079407 |
| Stanford University                          | stanford.edu/admission/        |  0.5768249034881592 |
| Massachusetts Institute of Technology (MIT)  | mitadmissions.org/             | 0.26201900839805603 |
+----------------------------------------------+--------------------------------+---------------------+

Example 8: Custom Analyzer, n_gram Tokenizer, Custom Character Filter (`html_strip`), and Custom Token Filter (`lower_case`) to Search for HTML entities

Use a custom analyzer, n_gram tokenizer, html_strip as character filter, and lower_case as token filter to search for HTML entities in queries.

Create a table, insert data and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE html_table_n_gram (
  title VARCHAR(200),
  content VARCHAR(200),
  FULLTEXT USING VERSION 2 (title, content) INDEX_OPTIONS
 '{
    "analyzer": {
        "custom": {"char_filters": ["html_strip"],
                   "tokenizer": "n_gram",
                   "token_filters":["lower_case"]
                   }
    }
 }'
);
INSERT INTO html_table_n_gram (title, content) VALUES
('Exciting News', 'We&apos;re thrilled to announce our new project!</p>'),
('Learning Journey', 'Learning is a never-ending journey &amp; I&apos;m excited!</p>'),
('Success Story', 'Our team has achieved great things &amp; we&apos;re proud!</p>'),
('Grateful Heart', 'Thank you for being a part of our journey &amp; supporting us!</p>'),
('Future Goals', 'We&apos;re looking forward to achieving even more!</p>');
OPTIMIZE TABLE html_table_n_gram FLUSH;

Observe the result of the search query for the misspelled HTML entity and compare the search result with the score below.

SQL

SELECT title,content, MATCH(TABLE html_table_n_gram) AGAINST("content:I',") AS score
FROM html_table_n_gram
WHERE score
ORDER BY score DESC;

+------------------+--------------------------------------------------------------------+---------------------+
| title            | content                                                            | score               |
+------------------+--------------------------------------------------------------------+---------------------+
| Learning Journey | Learning is a never-ending journey &amp; I&apos;m excited!</p>     |  0.5430432558059692 |
| Success Story    | Our team has achieved great things &amp; we&apos;re proud!</p>     | 0.31375283002853394 |
| Exciting News    | We&apos;re thrilled to announce our new project!</p>               | 0.26527124643325806 |
| Future Goals     | We&apos;re looking forward to achieving even more!</p>             |  0.2177681028842926 |
| Grateful Heart   | Thank you for being a part of our journey &amp; supporting us!</p> |  0.1819886565208435 |
+------------------+--------------------------------------------------------------------+---------------------+

Example 9: Portuguese Analyzer with score

Use a portuguese analyzer to search for a Portuguese text in queries.

Create a table, insert data and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE portuguese_news (
  headline VARCHAR(200),
  content TEXT,
  FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
  '{
      "analyzer": "portuguese"
  }'
);
INSERT INTO portuguese_news (headline, content) VALUES
('Cenário Econômico Brasileiro', 'O Brasil enfrenta desafios econômicos com a inflação em alta e a taxa de desemprego ainda elevada.'),
('Mercado de Ações em Alta', 'As ações brasileiras registraram ganhos significativos, impulsionadas por resultados financeiros positivos de grandes empresas.'),
('Nova Política Monetária do Banco Central', 'O Banco Central do Brasil anunciou mudanças na política monetária para conter a inflação e estimular o crescimento econômico.'),
('Investimentos Estrangeiros no Brasil', 'O país atraiu um aumento de investimentos estrangeiros diretos, especialmente em setores de tecnologia e energia renovável.'),
('Tendências do Mercado Imobiliário', 'O mercado imobiliário brasileiro mostra sinais de recuperação, com aumento nas vendas de imóveis e novos lançamentos.');
OPTIMIZE TABLE portuguese_news FLUSH;

Observe the result of the search query for the Portuguese text and compare the search result with the score below.

SQL

SELECT content, MATCH(TABLE portuguese_news) AGAINST ("content:Brasil") AS score
FROM portuguese_news
WHERE score
ORDER BY score DESC;

+-------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| content                                                                                                                             | score               |
+-------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| O Brasil enfrenta desafios econômicos com a inflação em alta e a taxa de desemprego ainda elevada.                                  | 0.22189012169837952 |
| O Banco Central do Brasil anunciou mudanças na política monetária para conter a inflação e estimular o crescimento econômico.       |  0.2059776782989502 |
+-------------------------------------------------------------------------------------------------------------------------------------+---------------------+

Example 10: Spanish Analyzer with custom stop words

Use a spanish analyzer with custom stop words to search for a Spanish text in queries.

Create a table, insert data and optimize the table to ensure all data is included in results.

SQL

CREATE TABLE spanish_news (
 headline VARCHAR(200),
 content TEXT,
 FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
 '{
     "analyzer": {"spanish": {"stopset": ["descubrimiento", "tratamiento", "nuevo"]}}
 }'
);
INSERT INTO spanish_news (headline, content) VALUES
('Descubrimiento de un nuevo tratamiento para la diabetes', 'Investigadores han desarrollado un tratamiento innovador que mejora el control del azúcar en sangre en pacientes diabéticos.'),
('Avances en la detección temprana del cáncer', 'Un nuevo método permite detectar el cáncer en etapas más tempranas, aumentando las posibilidades de tratamiento exitoso.'),
('Nuevo enfoque para tratar enfermedades cardíacas', 'Se ha introducido un nuevo enfoque terapéutico que reduce significativamente el riesgo de ataques cardíacos.'),
('Investigación sobre un gen relacionado con el Alzheimer', 'Científicos han identificado un gen que podría estar vinculado a la enfermedad de Alzheimer, lo que abre nuevas posibilidades para el tratamiento.'),
('Desarrollo de una vacuna contra COVID-19', 'Un equipo de investigadores ha anunciado resultados prometedores en la efectividad de una nueva vacuna contra COVID-19.');
OPTIMIZE TABLE spanish_news FLUSH;

Observe the results of two search queries below: one for the defined Spanish stop word and another for the actual Spanish stop word. The defined stop words above overwrite the actual stop words.

SQL

SELECT *
FROM spanish_news
WHERE MATCH(TABLE spanish_news) AGAINST("content:nuevo");

Empty set

SQL

SELECT *
FROM spanish_news
WHERE MATCH(TABLE spanish_news) AGAINST("content:el");

+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| headline                                                 | content                                                                                                                                              |
+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Investigación sobre un gen relacionado con el Alzheimer  | Científicos han identificado un gen que podría estar vinculado a la enfermedad de Alzheimer, lo que abre nuevas posibilidades para el tratamiento.   |
| Avances en la detección temprana del cáncer              | Un nuevo método permite detectar el cáncer en etapas más tempranas, aumentando las posibilidades de tratamiento exitoso.                             |
| Descubrimiento de un nuevo tratamiento para la diabetes  | Investigadores han desarrollado un tratamiento innovador que mejora el control del azúcar en sangre en pacientes diabéticos.                         |
| Nuevo enfoque para tratar enfermedades cardíacas         | Se ha introducido un nuevo enfoque terapéutico que reduce significativamente el riesgo de ataques cardíacos.                                         |
+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+

Example 11: Standard Tokenizer with Custom Stop Words

Use a standard tokenizer with custom stop words.

Create a procedure to insert data into a table named t. This procedure will be used to generate data for the examples in this section.

SQL

DELIMITER //
CREATE OR REPLACE PROCEDURE insert_flush()
AS
BEGIN
   INSERT INTO t VALUES( 1, "On April 23, 2013, SingleStore launched its first generally available version of the database to the public as MemSQL."),
   (2, "Early versions only supported row-oriented tables, and were highly optimized for cases where all data can fit within main memory."),
   (3, "This design was based on the idea that the cost of RAM would continue to decrease exponentially over time, in a trend similar to Moore's law."),
   (4, "This would eventually allow most use cases for database systems to store their data exclusively in memory not on disk."),
   (5, "Shortly after launch, MemSQL added general support for an on-disk column-based storage format to work alongside the in-memory rowstore."),
   (6, "The decreases in cost of memory slowed over time, and the market for purely in-memory database systems largely failed to materialize, with increasing demand for disk-based OLAP workloads."),
   (7, "Thus, over time, MemSQL's columnstore became a major focus and a crucial feature for customers."),
   (8, "On October 27, 2020, MemSQL rebranded to SingleStore to reflect a shift in focus away from exclusively in-memory workloads."),
   (9, "The new name highlights the goal of achieving a universal storage format capable of supporting both transactional and analytical use cases."),
   (10, "In its current product release, v.7.5, SingleStore became the first and only database to combine separation of storage and compute plus system of record into a single platform."),
   (11, "Headquartered in San Francisco, California, in June 2021 singlestore.com opened an office in Raleigh, North Carolina. As part of the office opening, SingleStore launched Launch Pad, a center for innovation to incubate and prototype solutions."),
   (12, "Its other offices include Sunnyvale, California, seattle@singlestore.com, Washington, and Lisbon, Portugal."),
   (13, "seattle@singlestore.com");
   OPTIMIZE TABLE t FLUSH;
END;//
DELIMITER ;

Example 11a: No Stop Words

Create a table without stop words and insert data into that table.

SQL

CREATE TABLE t (
  id INT,text VARCHAR(400),
  SORT KEY (id),
  FULLTEXT USING VERSION 2 KEY(text)
    INDEX_OPTIONS
      '{"analyzer" :
         {"custom" : {"tokenizer" : "standard"}}
       }');
CALL insert_flush();

The two queries below query the table for the words The and the, respectively. Since there are no stop words, results will be returned for both queries.

SQL

SELECT text, MATCH(table t) AGAINST ('text:The') AS SCORE
FROM t
WHERE score
ORDER BY score DESC;

*** 1. row ***
 text: The new name highlights the goal of achieving a universal storage format capable of supporting both transactional and analytical use cases.
SCORE: 0.5822426080703735
*** 2. row ***
 text: The decreases in cost of memory slowed over time, and the market for purely in-memory database systems largely failed to materialize, with increasing demand for disk-based OLAP workloads.
SCORE: 0.4953887462615967
2 rows in set (0.09 sec)

SQL

SELECT text, MATCH(TABLE t) AGAINST ('text:the') AS SCORE
FROM t
WHERE score
ORDER BY score DESC;

*** 1. row ***
 text: Shortly after launch, MemSQL added general support for an on-disk column-based storage format to work alongside the in-memory rowstore.
SCORE: 0.5361358523368835
*** 2. row ***
 text: On April 23, 2013, SingleStore launched its first generally available version of the database to the public as MemSQL.
SCORE: 0.216837078332901
*** 3. row ***
 text: This design was based on the idea that the cost of RAM would continue to decrease exponentially over time, in a trend similar to Moore's law.
SCORE: 0.19964565336704254
*** 4. row ***
 text: The new name highlights the goal of achieving a universal storage format capable of supporting both transactional and analytical use cases.
SCORE: 0.1568010002374649
*** 5. row ***
 text: In its current product release, v.7.5, SingleStore became the first and only database to combine separation of storage and compute plus system of record into a single platform.
SCORE: 0.13726447522640228
*** 6. row ***
 text: The decreases in cost of memory slowed over time, and the market for purely in-memory database systems largely failed to materialize, with increasing demand for disk-based OLAP workloads.
SCORE: 0.1351594626903534
*** 7. row ***
 text: Headquartered in San Francisco, California, in June 2021 singlestore.com opened an office in Raleigh, North Carolina. As part of the office opening, SingleStore launched Launch Pad, a center for innovation to incubate and prototype solutions.
SCORE: 0.12553386390209198

Drop the table so that the tablename t can be used with the subsequent examples.

SQL

DROP TABLE t;

Example 11b: Default English Stop Words

Create a table using default English stop words and query that table for the words The and the. The words The and the are included in default English stop words. Thus, neither query returns results.

SQL

CREATE TABLE t (
  id INT, text VARCHAR(400),
  SORT KEY (id),
  FULLTEXT USING VERSION 2 KEY (text) INDEX_OPTIONS
   '{"analyzer" :
      {"custom" :
        {"tokenizer" : "standard", token_filters: ["stop"]}
     }}');
CALL insert_flush();

SQL

SELECT text, MATCH(table t) AGAINST ('text:The') AS SCORE
FROM t
WHERE score
ORDER BY score DESC;

Empty set (0.01 sec)

SQL

SELECT text, MATCH(TABLE t) AGAINST ('text:the') AS SCORE
FROM t
WHERE score
ORDER BY score DESC;

Empty set (0.01 sec)

SQL

DROP TABLE t;

Example 11c: Custom English Stop Words

Create a table using custom English stop words and insert the word the as a stop word. And query the table for the words The and the. By default, when using the custom stop words token filter, case is ignored, so no results are returned for either query.

SQL

CREATE TABLE t (
  id INT, text VARCHAR(400),
  SORT KEY (id),
  FULLTEXT USING VERSION 2 KEY (text) index_options
   '{"analyzer" :
      {"custom" :
        {"tokenizer" : "standard", token_filters: [{"stop": {"words": ["the"]}}]}
     }}');
CALL insert_flush();

SQL

SELECT text, MATCH(table t) AGAINST ('text:The') AS SCORE
FROM t
WHERE score
ORDER BY score DESC;

Empty set (0.06 sec)

SQL

SELECT text, MATCH(TABLE t) AGAINST ('text:the') AS SCORE
FROM t
WHERE score
ORDER BY score DESC;

Empty set (0.05 sec)

SQL

DROP TABLE t;

Example 11d: Custom English Stop Words - Do Not Ignore Case

In the following, the query is modified so case is not ignored. In this example, results are returned for the first query, but not the second.

SQL

CREATE TABLE t (
  id INT, text VARCHAR(400),
  SORT KEY (id),
  FULLTEXT USING VERSION 2 KEY (text) INDEX_OPTIONS
   '{"analyzer" :
      {"custom" :
        {"tokenizer" : "standard",
          token_filters: [{"stop": {"ignoreCase": false,"words": ["the"]}}]}
     }}');
CALL insert_flush();

SQL

SELECT text, MATCH(table t) AGAINST ('text:The') AS SCORE
FROM t
WHERE score
ORDER BY score DESC;

*** 1. row ***
 text: The new name highlights the goal of achieving a universal storage format capable of supporting both transactional and analytical use cases.
SCORE: 0.7002022862434387
*** 2. row ***
 text: The decreases in cost of memory slowed over time, and the market for purely in-memory database systems largely failed to materialize, with increasing demand for disk-based OLAP workloads.
SCORE: 0.6494264602661133
2 rows in set (0.06 sec)

SQL

SELECT text, MATCH(TABLE t) AGAINST ('text:the') AS SCORE
FROM t
WHERE score
ORDER BY score DESC;

Empty set (0.03 sec)

SQL

DROP TABLE t;

Example 12 - Use ngrams to Find a String that Contains a Substring

The n_gram tokenizer can be used to search for strings that contain a specific substring.

That is, you can use ngrams to do a search that is equivalent to having wildcards at the beginning and end of a search term, as shown in the following example.

Create a table with a full-text index that uses the n_gram tokenizer and insert data into that table.

SQL

CREATE TABLE university
   (name VARCHAR(400),
   admission_page VARCHAR(400),
   SORT KEY (name),
   FULLTEXT USING VERSION 2 KEY(admission_page)
   INDEX_OPTIONS '{"analyzer" : {"custom" : {"tokenizer" : "n_gram"}}}}');
INSERT INTO university (name, admission_page) VALUES
('Harvard University', 'college.harvard.edu/admissions'),
('Stanford University', 'stanford.edu/admission/'),
('Massachusetts Institute of Technology (MIT)', 'mitadmissions.org/'),
('California Institute of Technology (Caltech)', 'admissions.caltech.edu/'),
('University of Chicago', 'uchicago.edu/en/admissions');
OPTIMIZE TABLE university FLUSH;

The following statements and query perform the equivalent of a match against %arvar%. The substring arvar is matched because the n_gram tokenizer splits words into small pieces. As a result, the name Harvard matches arvar with the highest ranked score.

SQL

SET sql_mode = pipes_as_concat;
SET @q = "admission_page:" || "arvar";
SELECT *, MATCH(TABLE university) AGAINST(@q) as score
FROM university
WHERE MATCH(TABLE university) AGAINST(@q)
ORDER BY score DESC;

+----------------------------------------------+--------------------------------+---------------------+
| name                                         | admission_page                 | score               |
+----------------------------------------------+--------------------------------+---------------------+
| Harvard University                           | college.harvard.edu/admissions |   2.921722650527954 |
| Massachusetts Institute of Technology (MIT)  | mitadmissions.org/             |  0.5870963335037231 |
| Stanford University                          | stanford.edu/admission/        |  0.5820327997207642 |
| University of Chicago                        | uchicago.edu/en/admissions     |  0.2338818907737732 |
| California Institute of Technology (Caltech) | admissions.caltech.edu/        | 0.16432610154151917 |
+----------------------------------------------+--------------------------------+---------------------+

In addition, you can use a common table expression (CTE) and LIKE, to obtain partial string and substring matches with the speed of a full-text index. The query below runs faster than a query with only a LIKE expression.

SQL

WITH matches AS (
    SELECT *, MATCH(TABLE university) AGAINST(@q) AS score
    FROM university
    WHERE MATCH(TABLE university) AGAINST(@q)
    ORDER BY score DESC
    LIMIT 10
)
SELECT *
FROM matches
WHERE admission_page LIKE '%arvar%';

+--------------------+--------------------------------+-------------------+
| name               | admission_page                 | score             |
+--------------------+--------------------------------+-------------------+
| Harvard University | college.harvard.edu/admissions | 2.921722650527954 |
+--------------------+--------------------------------+-------------------+

Supported Language Analyzers

The following table lists the supported language analyzers.

Language	Default Stop Word List Link
`arabic`	Apache Lucene Arabic Stop Words
`bulgarian`	Apache Lucene Bulgarian Stop Words
`bengali`	Apache Lucene Bengali Stop Words
`brazilian_portuguese`	Apache Lucene Brazilian, Portuguese Stop Words
`catalan`	Apache Lucene Catalan Stop Words
`cjk`	Apache Lucene CJK Stop Words
`sorani_kurdish`	Apache Lucene Sorani, Kurdish Stop Words
`czech`	Apache Lucene Czech Stop Words
`danish`	Apache Lucene Danish Stop Words
`german`	Apache Lucene German Stop Words
`greek`	Apache Lucene Greek Stop Words
`english`	"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
`spanish`	Apache Lucene Spanish Stop Words
`estonian`	Apache Lucene Estonian Stop Words
`basque`	Apache Lucene Basque Stop Words
`persian`	Apache Lucene Persian Stop Words
`finnish`	Apache Lucene Finnish Stop Words
`french`	Apache Lucene French Stop Words
`irish`	Apache Lucene Irish Stop Words
`galician`	Apache Lucene Galician Stop Words
`hindi`	Apache Lucene Hindi Stop Words
`hungarian`	Apache Lucene Hungarian Stop Words
`armenian`	Apache Lucene Armenian Stop Words
`indonesian`	Apache Lucene Indonesian Stop Words
`italian`	Apache Lucene Italian Stop Words
`korean`	This is Apache Lucene's Korean (Nori) Analyzer. Filters tokens based on part-of-speech tags: EF, EC, ETN, ETM, IC, JKS, JKC, JKG, JKO, JKB, JKV, JKQ, JX, JC, MAG, MAJ, MM, SP, SSC, SSO, SC, SE, XPN, XSA, XSN, XSV, UNA, NA, VSV. Part of speech tags. Custom stop word lists are not supported with the `korean` analyzer Lucene nori API. Lucene Analyzer for Korean.
`lithuanian`	Apache Lucene Lithuanian Stop Words
`latvian`	Apache Lucene Latvian Stop Words
`nepali`	Apache Lucene Nepali Stop Words
`dutch`	Apache Lucene Dutch Stop Words
`norwegian`	Apache Lucene Norwegian Stop Words
`portuguese`	Apache Lucene Portuguese Stop Words
`romanian`	Apache Lucene Romanian Stop Words
`russian`	Apache Lucene Russian Stop Words
`serbian`	Apache Lucene Serbian Stop Words
`swedish`	Apache Lucene Swedish Stop Words
`tamil`	Apache Lucene Tamil Stop Words
`telegu`	Apache Lucene Telegu Stop Words
`thai`	Apache Lucene Thai Stop Words
`turkish`	Apache Lucene Turkish Stop Words

Supported Tokenizers

The table below lists supported tokenizers. These tokenizers may have custom parameters, which can be obtained and used as described below.

Get Parameters

The parameters and description of each of these tokenizers can be obtained from the links included in the table.

Example: Get parameters for the `uax_url_email` tokenizer

To obtain the parameters for the uax_url_email tokenizer, follow the tokenizer factory link for the uax_url_email tokenizer, which can be found in the middle column of the table below.

The following is the tokenizer factory from the uax_url_email tokenizer, which has been obtained from the tokenizer factory link. This tokenizer has one parameter maxTokenLength, which defaults to 255.

SQL

<fieldType name="text_urlemail" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.UAX29URLEmailTokenizerFactory" maxTokenLength="255"/>
   </analyzer>
 </fieldType>

The INDEX_OPTIONS string to create a full-text index with the uax_url_email tokenizer specifying a maxTokenLength of 300 is shown below.

SQL

INDEX_OPTIONS '{
    "analyzer": {
   	  "custom": {
              "tokenizer": {
                   "uax_url_email" : {
	                 "maxTokenLength": 300
                   }
               }
   	  }
    }
}'

List of Supported Tokenizers

"tokenizer" (Case-Sensitive)	Tokenizer Factory Link (Includes Parameters)	Tokenizer Class Link (Includes Description)
`uax_url_email`	UAX29URLEmailTokenizerFactory	UAX29URLEmailTokenizer
`whitespace`	WhitespaceTokenizerFactory	WhitespaceTokenizer
`classic`	ClassicTokenizerFactory	ClassicTokenizer
`simple_pattern`	SimplePatternTokenizerFactory	SimplePatternTokenizer
`standard`	StandardTokenizerFactory	StandardTokenizer
`keyword`	KeywordTokenizerFactory	KeywordTokenizer
`letter`	LetterTokenizerFactory	LetterTokenizer
`simple_pattern_split`	SimplePatternSplitTokenizerFactory	SimplePatternSplitTokenizer
`pattern`	PatternTokenizerFactory	PatternTokenizer
`thai`	ThaiTokenizerFactory	ThaiTokenizer
`edge_n_gram`	EdgeNGramTokenizerFactory	EdgeNGramTokenizer
`n_gram`	NGramTokenizerFactory	NGramTokenizer
`wikipedia`	WikipediaTokenizerFactory	WikipediaTokenizer
`path_hierarchy`	PathHierarchyTokenizerFactory	PathHierarchyTokenizer
`korean`	Description: Tokenizer for Korean that uses morphological analysis. Supports the following attributes: userDictionary (JSON array of strings): A JSON array of strings; each string is a term in the dictionary. decompoundMode (JSON string): determines how the tokenizer handles POS.Type.COMPOUND, POS.Type.INFLECT, and POS.Type.PREANALYSIS tokens. Values can be 'none', 'discard', 'mixed', the default is 'discard'. outputUnknownUnigrams (JSON boolean value): If "true" outputs unigrams for unknown words. discardPunctuation (JSON boolean value): If "true", punctuation tokens are dropped from the output.

Supported Token Filters

This table lists the supported token filters, the filter name and a link for the token filter factory documentation which provides parameters and description for the token filter.

"token_filters" (Case-Sensitive)	Lucene Link for Parameters and Description
`russian_light_stem`	RussianLightStemFilterFactory
`scandinavian_normalization`	ScandinavialnNormalizationFilterFactory
`decimal_digit`	DecimalDigitFilterFactory
`ascii_folding`	ASCIIFoldingFilterFactory
`german_stem`	GermanStemFilterFactory
`bulgarian_stem`	BulgarianStemFilterFactory
`codepoint_count`	CodepointCountFilterFactory
`pattern_replace`	PatternReplaceFilterFactory
`persian_normalization`	PersianNormalizationFilterFactory
`limit_token_position`	LimitTokenPositionFilterFactory
`porter_stem`	PorterStemFilterFactory
`greek_stem`	GreekStemFilterFactory
`finnish_light_stem`	FinnishLightStemFilterFactory
`fingerprint`	FingerprintFilterFactory
`cjk_width`	CJKWidthFilterFactory
`reverse_string`	ReverseStringFilterFactory
`common_grams`	CommonGramsFilterFactory
`delimited_boost_token`	DelimitedBoostTokenFilterFactory
`scandinavian_folding`	ScandinavianFoldingFilterFactory
`hindi_stem`	HindiStemFilterFactory
`spanish_plural_stem`	SpanishPluralStemFilterFactory
`indonesian_stem`	IndonesianStemFilterFactory
`trim`	TrimFilterFactory
`french_light_stem`	FrenchLightStemFilterFactory
`classic`	ClassicFilterFactory
`fixed_shingle`	FixedShingleFilterFactory
`english_possessive`	EnglishPossessiveFilterFactory
`german_normalization`	GermanNormalizationFilterFactory
`keyword_repeat`	KeywordRepeatFilterFactory
`min_hash`	MinHashFilterFactory
`remove_duplicates_token`	RemoveDuplicatesTokenFilterFactory
`snowball_porter`	SnowballPorterFilterFactory
`german_minimal_stem`	GermanMinimalStemFilterFactory
`norwegian_light_stem`	NorwegianLightStemFilterFactory
`english_minimal_stem`	EnglishMinimalStemFilterFactory
`norwegian_minimal_stem`	NorwegianMinimalStemFilterFactory
`czech_stem`	CzechStemFilterFactory
`sorani_stem`	SoraniStemFilterFactory
`limit_token_offset`	LimitTokenOffsetFilterFactory
`persian_stem`	PersianStemFilterFactory
`common_grams_query`	CommonGramsQueryFilterFactory
`sorani_normalization`	SoraniNormalizationFilterFactory
`swedish_light_stem`	SwedishLightStemFilterFactory
`k_stem`	KStemFilterFactory
`french_minimal_stem`	FrenchMinimalStemFilterFactory
`hyphenated_words`	HyphenatedWordsFilterFactory
`capitalization`	CapitalizationFilterFactory
`lower_case`	LowerCaseFilterFactory
`hungarian_light_stem`	HungarianLightStemFilterFactory
`telugu_stem`	SynonymGraphFilterFactory
`italian_light_stem`	ItalianLightStemFilterFactory
`limit_token_count`	LimitTokenCountFilterFactory
`swedish_minimal_stem`	SwedishLightStemFilterFactory
`galician_minimal_stem`	GalicianMinimalStemFilterFactory
`portuguese_minimal_stem`	PortugueseMinimalStemFilterFactory
`bengali_normalization`	BengaliNormalizationFilterFactory
`galician_stem`	GalicianStemFilterFactory
`turkish_lower_case`	TurkishLowerCaseFilterFactory
`bengali_stem`	BengaliStemFilterFactory
`indic_normalization`	IndicNormalizationFilterFactory
`keep_word`	KeepWordFilterFactory
`drop_if_flagged`	DictionaryCompoundWordTokenFilterFactory
`latvian_stem`	LatvianStemFilterFactory
`portuguese_light_stem`	PortugueseLightStemFilterFactory
`apostrophe`	ApostropheFilterFactory
`arabic_stem`	ArabicStemFilterFactory
`delimited_term_frequency_token`	DelimitedTermFrequencyTokenFilterFactory
`irish_lower_case`	IrishLowerCaseFilterFactory
`edge_n_gram`	EdgeNGramFilterFactory
`german_light_stem`	GermanLightStemFilterFactory
`pattern_capture_group`	PatternCaptureGroupFilterFactory
`spanish_light_stem`	SpanishLightStemFilterFactory
`hindi_normalization`	HindiNormalizationFilterFactory
`norwegian_normalization`	NorwegianNormalizationFilterFactory
`shingle`	ShingleFilterFactory
`telugu_normalization`	SynonymGraphFilterFactory
`date_recognizer`	DateRecognizerFilterFactory
`n_gram`	NGramFilterFactory
`upper_case`	UpperCaseFilterFactory
`brazilian_stem`	BrazilianStemFilterFactory
`cjk_bigram`	CJKBigramFilterFactory
`truncate_token`	TruncateTokenFilterFactory
`greek_lower_case`	GreekLowerCaseFilterFactory
`length`	LengthFilterFactory
`arabic_normalization`	ArabicNormalizationFilterFactory
`portuguese_stem`	PortugueseStemFilterFactory
`elision`	ElisionFilterFactory
`korean_part_of_speech`	KoreanPartOfSpeechStopFilterFactory A token filter that removes tokens that match a set of part-of-speech tags
`korean_reading_form`	KoreanReadingFormFilterFactory A token filter that rewrites tokens written in Hanja to their Hangul form.
`korean_number`	KoreanNumberFilterFactory A token filter that normalizes Korean numbers to Arabic decimal numbers in half-width characters.
`stop`	A custom token filter that removes stop words from a token stream. Custom Stop Words

Supported Character Filters

This table lists the supported character filters, the name, and a link for the parameters.

"char_filters" (case-sensitive)	Lucene Link for Parameters
`persian`	PersianCharFilterFactory
`cjk_width`	CJKWidthCharFilterFactory
`html_strip`	HTMLStripCharFilterFactory
`pattern_replace`	PatternReplaceCharFilterFactory

Full Text VERSION 2 Custom Analyzers

On this page

Specify an Analyzer

Analyzers

Built-in Analyzers

Custom Analyzers

Common Tokenizers

Common Token Filters

Custom Stop Words

Common Character Filters

Example of Using Parameters

Stemming

NGrams

Examples

Example 1: Custom Analyzer with Whitespace tokenizer

Example 2: Custom Tokenizer, Character Filter, and Token Filter to Search for HTML Entities

Example 3: Custom Analyzer, standard Tokenizer, and Custom Token Filter (cjk_width) to Search Japanese Text

Example 4: Korean (nori) Analyzer

Example 5: Korean (nori) Analyzer with User Dictionary

Example 5a - Search for 수영장 with and without User Dictionary

Example 5b - Search for 수영 with and without User Dictionary

Example 6: Custom Analyzer, standard Tokenizer, Custom Token Filters (elision, snowball_porter) to Search Italian Text

Example 7: Custom Analyzer and N Gram Tokenizer

Example 8: Custom Analyzer, n_gram Tokenizer, Custom Character Filter (html_strip), and Custom Token Filter (lower_case) to Search for HTML entities

Example 9: Portuguese Analyzer with score

Example 10: Spanish Analyzer with custom stop words

Example 11: Standard Tokenizer with Custom Stop Words

Example 11a: No Stop Words

Example 11b: Default English Stop Words

Example 11c: Custom English Stop Words

Example 11d: Custom English Stop Words - Do Not Ignore Case

Example 12 - Use ngrams to Find a String that Contains a Substring

Supported Language Analyzers

Supported Tokenizers

Get Parameters

Example: Get parameters for the uax_url_email tokenizer

List of Supported Tokenizers

Supported Token Filters

Supported Character Filters

Was this article helpful?

On this page

Was this article helpful?

Example 3: Custom Analyzer, standard Tokenizer, and Custom Token Filter (`cjk_width`) to Search Japanese Text

Example 6: Custom Analyzer, standard Tokenizer, Custom Token Filters (`elision`, `snowball_porter`) to Search Italian Text

Example 8: Custom Analyzer, n_gram Tokenizer, Custom Character Filter (`html_strip`), and Custom Token Filter (`lower_case`) to Search for HTML entities

Example: Get parameters for the `uax_url_email` tokenizer