# Full Text VERSION 2 Custom Analyzers

SingleStore supports custom analyzers for full-text VERSION 2 search. Users can customize full-text search by:

* Using built-in analyzers for a variety of languages. The built-in analyzers can be customized with custom stop-word lists.
* Using custom analyzers in which a user can specify a tokenizer, optional token and character filters, and an optional stop-word list.

In general, an analyzer contains three components: a tokenizer, character filters, and token filters. An analyzer must have exactly one tokenizer; it can have zero or more character filters and zero or more token filters.

A tokenizer takes a stream of characters and breaks that stream into individual tokens, for example split on whitespace characters. A character filter takes the stream of text data and transforms it in a pre-defined way, for example, removing all HTML tags. Token filters receive a stream of tokens and may add, change, or remove tokens, for example, lowercase all tokens or remove stop words. 

Users can choose from a list of pre-configured analyzers and use them without any modifications. Users can also create their own analyzers by specifying a tokenizer, character filters, and token filters to obtain a fully customized search experience. 

Refer to [Working with Full-Text Search](https://docs.singlestore.com/db/v9.1/developer-resources/functional-extensions/working-with-full-text-search.md) for more information on full-text search.

## Specify an Analyzer

Specify an analyzer by passing an analyzer configuration in JSON format to `INDEX_OPTIONS`, which is a JSON string that contains the index configuration. In this JSON, the analyzer key is a string or a nested JSON value.

* Specify the name of a built-in analyzer (e.g.: `standard`, `cjk`, etc.) as a string.
* Specify a customized built-in analyzer or a custom analyzer as a nested JSON value.

The three examples below show a built-in analyzer with no customizations, a built-in analyzer with a customized set of stop words, and a custom analyzer. A full set of examples can be found in [Examples](https://docs.singlestore.com/#UUID-cd288605-d189-0467-47a1-357217158b48.md). Refer to [Analyzers](https://docs.singlestore.com/#section-idm234616810166095.md) for details on analyzers.

Specify the built-in analyzer for Chinese, Japanese, and Korean characters, called the `cjk` analyzer, with no customizations.

```sql
CREATE TABLE t (
	title VARCHAR(200),
	content VARCHAR(200),
	FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
        '{ "analyzer": "cjk"}'
);

```

Specify the built-in `cjk` analyzer with a customized set of stop words. Built-in analyzers can be customized with custom stop word lists; no other customizations for built-in analyzers are supported.

```sql
CREATE TABLE t (
	title VARCHAR(200),
	content VARCHAR(200),
	FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS 
        '{"analyzer": { 
               "cjk": {
                    "stopset": [
                           "这",
            	           "那"
                     ]
                }
          }
        }'
);
```

> **📝 Note**: In addition to the `cjk` analyzer, the Korean `nori` analyzer is also supported.

Specify a custom analyzer, which uses the `whitespace` tokenizer, the `html_strip` character filter, and the `lower_case` token filter. The analyzer name must be `custom`. Additional character and token filters can be specified by adding additional `char_filters` and `token_filters` key pairs.

```sql
CREATE TABLE t (
	title VARCHAR(200),
	content VARCHAR(200),
	FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
        '{
	     "analyzer": {
	         "custom": {
	              "tokenizer": "whitespace",
	              "char_filters": ["html_strip"],
  	              "token_filters": ["lower_case"],
                 }
	     }
         }'
);

```

## Analyzers

There are two types of analyzers: built-in analyzers and custom analyzers.

> **📝 Note**: The examples in this section show only the `INDEX_OPTIONS` string (JSON) and omit the rest of the index creation command.

## Built-in Analyzers

Built-in analyzers are pre-configured analyzers including the standard analyzer and language-specific analyzers and do not require configuration. Built-in analyzers may be customized with custom stop-word lists.

The default analyzer is the [Apache Lucene standard analyzer](https://lucene.apache.org/core/8_8_1/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html), which uses the Apache Lucene standard tokenizer, lowercase token filters, and no stop words.

Specify a built-in analyzer, without customizations, by specifying the name of the analyzer as the value of the `analyzer` key.

The following example specifies the use of the `spanish` language analyzer.

```sql
INDEX_OPTIONS '{"analyzer" : "spanish"}'
```

A custom stop word list can be specified for a built-in analyzer by specifying a `stopset` in the JSON as shown in the following example. A custom stop word list is the only customization supported for built-in analyzers.

The following example specifies a custom stop word list for the standard analyzer.

The value of the `analyzer` key is a nested JSON value consisting of a key-value pair with key being the name of the analyzer (`spanish` in this example), and the value being another key-value pair consisting of the key `stopset`, and the value a JSON array of stop words.

```sql
INDEX_OPTIONS '{
	"analyzer": {
    	     "spanish": {
        	     "stopset": [
            	     "el",
            	     "la"
        	     ]
    	     }
	}
}'
```

SingleStore recommends using the default language analyzer, without stop word customization, in most cases, e.g. `'{"analyzer" : "catalan"}'`.

Refer to [Supported Language Analyzers](https://docs.singlestore.com/#section-idm234616824800932.md) for links to the default list of stop words for each analyzer.

## Custom Analyzers

Create a custom analyzer by using the analyzer name `custom` and by specifying a tokenizer and optional token and character filters.

A custom analyzer must specify:

* A *required* `tokenizer` - A tokenizer breaks up incoming text into tokens. In many cases, an analyzer will use a tokenizer as the first step in the analysis process. However, to modify text prior to tokenization, use `char_filters` (see below).
* An *optional array* of `token_filters`: A `token_filter` modifies tokens that have been created by the tokenizer. Common modifications performed by a `token_filter` are deletion, stemming, and case folding.
* An *optional array* of `char_filters`: A `char_filter` transforms the text before it is tokenized, while providing corrected character offsets to account for these modifications.

The example below shows the use of all three components, `tokenizer`, `char_filters`, and `token_filters`.

```sql
INDEX_OPTIONS '{
     "analyzer" : {
          "custom": {
	       "tokenizer": "whitespace",
	       "char_filters": ["html_strip"],
  	       "token_filters": ["lower_case"],
          }
     }
}'
```

Each of these three components (`tokenizer`, `char_filters`, `token_filters`) can be specified as a string with the name of the component or as a nested JSON with a configuration for the component.

The example below specifies a custom analyzer that uses the `whitespace` tokenizer, with a maximum length of 256 characters.

```sql
INDEX_OPTIONS '{
     "analyzer": {
          "custom": {
               "tokenizer": {
            	     "whitespace": {
                      "maxTokenLen": 256
            	     }
        	}
    	  }
     }
}'
```

## Common Tokenizers

Common tokenizers that are supported are listed in the table below. Refer to [Supported Tokenizers](https://docs.singlestore.com/#section-idm234616825064604.md) for a full list of supported tokenizers.

| **"tokenizer" (case-sensitive)** | **Parameters**                                                                                                                                                                                                                                          | **Description**                                                                                                                                                                                                                                                                                                                                                                                        |
| -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `whitespace`                     | `rule`(Optional, string). Defaults to`"unicode"`.`maxTokenLen`(Optional, integer). Defaults to`256`.[WhitespaceTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/core/WhitespaceTokenizerFactory.html) | Divides text at whitespace characters as defined by[Character.isWhitespace(int)](https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html?is-external=true#isWhitespace-int-). This definition excludes non-breaking spaces from whitespace characters.[WhitespaceTokenizer](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/core/WhitespaceTokenizer.html) |
| `standard`                       | `maxTokenLength`(Optional, integer). Defaults to 255.[StandardTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizerFactory.html)                                                | Implements Word Break rules from Unicode Text Segmentation:[Unicode Standard Annex #29](http://unicode.org/reports/tr29/).[StandardTokenizer (Lucene 6.6.0 API)](https://lucene.apache.org/core/8_3_0/core/org/apache/lucene/analysis/standard/StandardTokenizer.html?is-external=true)                                                                                                                |
| `n_gram`                         | `minGramSize`(Optional, integer). Defaults to`1`.`maxGramSize`(Optional, integer). Defaults to`2`.[NGramTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenizerFactory.html)            | Tokenizes the input into n-grams of the specified size(s).[NGramTokenizer](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenizer.html)                                                                                                                                                                                                                 |
| `uax_url_email`                  | `maxTokenLength`(Optional, integer). Defaults to`255`.[UAX29URLEmailTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizerFactory.html)                                     | Implements Word Break rules from Unicode Text Segmentation:[Unicode Standard Annex #29](http://unicode.org/reports/tr29/). URLs and email addresses are also tokenized.[UAX29URLEmailTokenizer](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizer.html)                                                                                 |

## Common Token Filters

Common token filters that are supported are listed in the table below. Refer to [Supported Token Filters](https://docs.singlestore.com/#section-idm234616825771108.md) for a full list of supported token filters.

| **"token\_filters" (Case-Sensitive)** | **Parameters**                                                                                                                                                                                                                                                                                    | **Description**                                                                                                                                                                                                                             |
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `shingle`                             | `minShingleSize`(Optional, integer). Defaults to`2`.`maxShingleSize`(Optional, integer). Defaults to`2`.[ShingleFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilterFactory.html)                                                | Constructs shingles (token n-grams), that is it creates combinations of tokens as a single token.[ShingleFilter](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html)               |
| `lower_case`                          | No parameters.[LowerCaseFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilterFactory.html)                                                                                                                                         | Normalizes token text to lower case.[LowerCaseFilter](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilter.html)                                                                           |
| `snowball_porter`                     | `protected`(Optional, string). Defaults to`"protectedkeyword.txt"`.`language`(Optional, string). Defaults to`"English"`.[SnowballPorterFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/snowball/SnowballPorterFilterFactory.html)                 | Stems words using a Snowball-generated stemmer. Available stemmers are listed in`org.tartarus.snowball.ext`.[SnowballFilter](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/snowball/SnowballFilter.html) |
| `n_gram`                              | `minGramSize`(Optional, integer). Defaults to`1`.`maxGramSize`(Optional, integer). Defaults to`2`.`preserveOriginal`(Optional, boolean). Defaults to`"true"`.[NGramFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramFilterFactory.html) | Tokenizes the input into n-grams of the given size(s).[NGramTokenFilter](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenFilter.html)                                                      |
| `stop`                                | `words`(Optional, array of stop words)`ignoreCase`(Optional, boolean). If true, all words are lower-cased first. Defaults to`false`.                                                                                                                                                              | Custom stop words token filter.Removes stop words from a token stream.                                                                                                                                                                      |

## Custom Stop Words

SingleStore provides a custom token filter named `stop` which allows a set of custom stop words to be specified. The stop token filter works with any custom analyzer.

The stop token filter has two parameters `words` and `ignoreCase`.

* `words`: An optional parameter containing a list of stop words. The list of stop words must be specified as a JSON array.
* `ignoreCase`: An optional boolean parameter indicating if case should be ignored. If set to `true`, all words are lower-cased before tokenization. Defaults to `false`.

Sample syntax for this custom token filter is as follows. Refer to [Example 11: Standard Tokenizer with Custom Stop Words](https://docs.singlestore.com/db/v9.1/developer-resources/functional-extensions/full-text-version-2-custom-analyzers/#section-idm234955757677161.md) for additional examples of using the `stop` token filter.

```sql
  FULLTEXT USING VERSION 2 KEY (text) INDEX_OPTIONS 
   '{"analyzer" : 
      {"custom" : 
        {"tokenizer" : "standard", 
          token_filters: [{"stop": 
                            {"ignoreCase": false,
                             "words": ["the"]}}
                         ]}
     }}');

```

## Common Character Filters

Common character filters that are supported are listed in the table below. Refer to [Supported Character Filters](https://docs.singlestore.com/#section-idm234616826086158.md) for a full list of supported character filters.

| **"char\_filters" (case-sensitive)** | **Parameters (includes Lucene Link)**                                                                                                                                                                              | **Description (includes Lucene Link)**                                                                                                                                                          |
| ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `html_strip`                         | `escapedTags`(Optional, string). Defaults to`"a, title"`.[HTMLStripCharFilterFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/charfilter/HTMLStripCharFilterFactory.html) | Wraps another Reader and attempts to strip out HTML.[HTMLStripCharFilter](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/charfilter/HTMLStripCharFilter.html) |

## Custom Column Mappings

Custom column mappings allow columns in a table and fields in a JSON document to be indexed using custom analyzers. This functionality allows you to index columns or fields in different languages with analyzers specific to each language.

For the most meaningful search results, it is important to use an analyzer specific to a language. Language-specific analyzers allow the use of language-aware stop lists, stemming, and word breaking. Applying a generic analyzer to text from multiple languages may result in lower-quality search results.

## Use Per-Column Analyzers

Per-column analyzers are defined using `INDEX_OPTIONS` in the table or index creation command. You can specify a custom analyzer for each column in a table and for keypaths in JSON and BSON columns.

The following command creates a table with a full-text index that indexes the `french_content` column with the `french` analyzer, the `english_content` column with the `english` analyzer, and all other columns (`title`) with the `standard` analyzer.

```sql
CREATE TABLE custom_col_analyzers (
    title VARCHAR(200),
    french_content VARCHAR(200),
    english_content VARCHAR(200),
    FULLTEXT USING VERSION 2 (title, french_content, english_content)
    INDEX_OPTIONS
        '{
            "analyzer": "standard",
            "mappings": {
                "french_content": {
                    "analyzer": "french"
                 },
                 "english_content": {
                     "analyzer": "english"
                 }
             }
         }'
);

```

## JSON and BSON Keypath Analyzers

Analyzers for keypaths in JSON and BSON columns are defined using `INDEX_OPTIONS` in the table or index creation command.

The following command specifies that the `json_column$english_content` field be indexed using the `english` analyzer, and the `json_column$french_content` field be indexed using the `french` analyzer. The mappings object, highlighted in bold, provides this specification.

```sql
CREATE TABLE json_keypath_analyzers (
    json_column JSON,
    FULLTEXT USING VERSION 2 KEY(json_column)
    INDEX_OPTIONS '{
        "mappings": {
            "json_column$english_content": {
                "analyzer": "english"
             },
            "json_column$french_content": {
                 "analyzer": "french"
             }
        }
    }'
);

```

## Example of Using Parameters

Specify a default `uax_url_email` tokenizer:

```sql
INDEX_OPTIONS '{
     "analyzer": {
           "custom": {
        	"tokenizer": "uax_url_email"
    	   }
     }
}'
```

Specify a `uax_url_email` tokenizer with custom parameters:

```sql
INDEX_OPTIONS '{
     "analyzer": {
    	  "custom": {
               "tokenizer": {
                    "uax_url_email" : {
	                 "maxTokenLength": 300
                    }
                }
    	  }
     }
}'
```

## Stemming

Stemming transforms words to their root form, often by removing suffixes and prefixes. In English, for example, the words "dressing" and "dressed", can be stemmed to "dress". This allows a search for one form of a verb (e.g. "dressing") to return documents containing other forms of the verb (e.g. "dressed" or "dress"). Stemming is language specific.

In SingleStore, stemming can be handled in two ways:

* Use a [built-in analyzer](https://docs.singlestore.com/#section-idm234616818255209.md) available from JLucene that incorporates stemming.

  * Many of the language-specific analyzers from JLucene do stem; however, the `standard` analyzer from JLucene does not stem.
* Use a [custom analyzer](https://docs.singlestore.com/#section-idm234616823004247.md) with a token filter such as `elision` or `snowball_porter` to customize stemming.

The following `CREATE TABLE` statement creates a full-text index using the `spanish` analyzer, which stems for the Spanish language.

```sql
CREATE TABLE spanish_lang (
	text VARCHAR(200)
	FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
        '{ "analyzer": "spanish"}'
);

```

The following `CREATE TABLE` statement uses a custom analyzer with custom token filters to stem for Italian text. Refer to [Example 6](https://docs.singlestore.com/db/v9.1/developer-resources/functional-extensions/full-text-version-2-custom-analyzers/#section-idm234616974238486.md) for the full example.

```sql
CREATE TABLE italian_architecture (
  architecture VARCHAR(400),
  description VARCHAR(400),
  SORT KEY (architecture),
  FULLTEXT USING VERSION 2 KEY(description)
  INDEX_OPTIONS '{"analyzer" :
        {"custom" : {"tokenizer" : "standard",
                     "token_filters": ["elision",
                                      {"snowball_porter" : {"language": "Italian"}}]}}}}'
);

```

## NGrams

NGram tokenizers split words into small pieces and are good for fast "fuzzy-style" matching using a full-text index. The minimum and maximum gram length is customizable. Refer to [Example 7](https://docs.singlestore.com/db/v9.1/developer-resources/functional-extensions/full-text-version-2-custom-analyzers/#section-idm234616974310271.md), [Example 8](https://docs.singlestore.com/db/v9.1/developer-resources/functional-extensions/full-text-version-2-custom-analyzers/#section-idm234616976167337.md), and [Example 12](https://docs.singlestore.com/db/v9.1/developer-resources/functional-extensions/full-text-version-2-custom-analyzers/#section-idm235055586170482.md) for examples of using a ngram tokenizer.

## Debug Full-Text Index Tokens

The tokens that will be generated for a full-text index can be viewed using the [ANALYZE FULLTEXT](https://docs.singlestore.com/db/v9.1/reference/sql-reference/full-text-search-functions/analyze-fulltext.md) command. Use the specification from the analyzer key for `INDEX_OPTIONS` as the `OPTIONS` for the `ANALYZE FULLTEXT` command.

## Examples

## Example 1: Custom Analyzer with Whitespace tokenizer

Use a `custom` analyzer and a `whitespace` tokenizer to search for text with a hyphen in queries.

Create a table, insert data, and optimize the table to ensure all data is included in results.

```sql
CREATE TABLE medium_articles (
   title VARCHAR(200),
   summary TEXT,
   FULLTEXT USING VERSION 2 (summary) INDEX_OPTIONS
   '{
       "analyzer": {
           "custom": {
               "tokenizer": "whitespace"
           }
       }
   }'
);

INSERT INTO medium_articles (title, summary) VALUES
('Build Real-Time Multimodal RAG Applications Using SingleStore!','This guide teaches you how to build a multimodal Retrieval-Augmented Generation (RAG) application using SingleStore, integrating various data types for enhanced AI responses.'),
('Building Production-Ready AI Agents with LangGraph: A Real-Life Use Case','This guide offers a solution for creating a scalable, production-ready multi-modal chatbot using LangChain, focusing on dividing tasks for improved control and efficiency.'),
('Scaling RAG from POC to Production','This guide explains Retrieval-Augmented Generation (RAG) for building reliable, context-aware applications using large language models (LLMs) and emphasizes the importance of scaling from proof of concept to production.'),
('Tech Stack For Production-Ready LLM Applications In 2024','This guide reviews preferred tools for the entire LLM app development lifecycle, emphasizing simplicity and ease of use in building scalable AI applications.'),
('LangGraph + Gemini Pro + Custom Tool + Streamlit = Multi-Agent Application Development','This guide teaches you to create a chatbot using LangGraph and Streamlit, leveraging LangChain for building stateful multi-actor applications that respond to user support requests.');

OPTIMIZE TABLE medium_articles FLUSH;
```

Observe the difference between the results of the two search queries below.

```sql
SELECT * 
FROM medium_articles 
WHERE MATCH(TABLE medium_articles) AGAINST ("summary:multimodal");


```

```output

+----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| title                                                          | summary                                                                                                                                                                        |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Build Real-Time Multimodal RAG Applications Using SingleStore! | This guide teaches you how to build a multimodal Retrieval-Augmented Generation (RAG) application using SingleStore, integrating various data types for enhanced AI responses. |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

```

```sql
SELECT * 
FROM medium_articles 
WHERE MATCH(TABLE medium_articles) AGAINST ("summary:multi-modal");


```

```output

+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| title                                                                    | summary                                                                                                                                                                     |
+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Building Production-Ready AI Agents with LangGraph: A Real-Life Use Case | This guide offers a solution for creating a scalable, production-ready multi-modal chatbot using LangChain, focusing on dividing tasks for improved control and efficiency. |
+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

```

## Example 2: Custom Tokenizer, Character Filter, and Token Filter to Search for HTML Entities

Use a `custom` analyzer, a `whitespace` tokenizer, `html_strip` as a character filter, and `lower_case` as a token filter to search for HTML entities in queries.

A character filter receives the original text data and converts it into a predefined format. A token filter receives a stream of tokens and can add, change, or remove tokens as needed.

In this example, `html_strip` as a character filter removes HTML tags and `lower_case` as a token filter lowercases the tokens.

Create a table, insert data, and optimize the table to ensure all data is included in results.

Search for HTML entities in queries.

```sql
CREATE TABLE html_table (
   title VARCHAR(200),
   content VARCHAR(200),
   FULLTEXT USING VERSION 2 (title, content) INDEX_OPTIONS
  '{
     "analyzer": {
         "custom": {"char_filters": ["html_strip"],
                    "tokenizer": "whitespace",
                    "token_filters":["lower_case"]
                    }
     }
  }'
);

INSERT INTO html_table (title, content) VALUES
('Exciting News', 'We&apos;re thrilled to announce our new project!</p>'),
('Learning Journey', 'Learning is a never-ending journey &amp; I&apos;m excited!</p>'),
('Success Story', 'Our team has achieved great things &amp; we&apos;re proud!</p>'),
('Grateful Heart', 'Thank you for being a part of our journey &amp; supporting us!</p>'),
('Future Goals', 'We&apos;re looking forward to achieving even more!</p>');

OPTIMIZE TABLE html_table FLUSH;

```

Search for HTML entity, and observe the result of the search query.

```sql
SELECT * 
FROM html_table 
WHERE MATCH(TABLE html_table) AGAINST("content:we're");

```

```output

+---------------+----------------------------------------------------------------+
| title         | content                                                        |
+---------------+----------------------------------------------------------------+
| Success Story | Our team has achieved great things &amp; we&apos;re proud!</p> |
| Exciting News | We&apos;re thrilled to announce our new project!</p>           |
| Future Goals  | We&apos;re looking forward to achieving even more!</p>         |
+---------------+----------------------------------------------------------------+

```

## Example 3: Custom Analyzer, standard Tokenizer, and Custom Token Filter (`cjk_width`) to Search Japanese Text

Use a `custom` analyzer, a `standard` tokenizer, and `cjk_width` as a token filter to search for a Japanese text in queries.

In this example, `cjk_width` as a token filter normalizes the width differences in CJK (Chinese, Japanese, and Korean) characters.

Create a table, insert data and optimize the table to ensure all data is included in results.

```sql
CREATE TABLE japanese_novels (
 title VARCHAR(200),
 content VARCHAR(200),
 FULLTEXT USING VERSION 2 (title, content) INDEX_OPTIONS
'{
   "analyzer": {
       "custom": {"tokenizer": "standard",
                  "token_filters":["cjk_width"]
                  }
   }
}'
);

INSERT INTO japanese_novels (title, content) VALUES
('ノルウェイの森', '村上春樹の代表作で、愛と喪失をテーマにしています。'),
('吾輩は猫である', '夏目漱石による作品で、猫の視点から人間社会を描いています。'),
('雪国', '川端康成の作品で、美しい雪景色と切ない恋を描いています。'),
('千と千尋の神隠し', '宮崎駿の作品で、少女が異世界で成長する物語です。'),
('コンビニ人間', '村田沙耶香の作品で、現代社会の孤独と適応を描いています。');

OPTIMIZE TABLE japanese_novels FLUSH;
```

Observe the result of the search query for the Japanese text below.

```sql
SELECT * 
FROM japanese_novels 
WHERE MATCH(TABLE japanese_novels) AGAINST("content: 夏");


```

```output

+-----------------------+-----------------------------------------------------------------------------------------+
| title                 | content                                                                                 |
+-----------------------+-----------------------------------------------------------------------------------------+
| 吾輩は猫である        | 夏目漱石による作品で、猫の視点から人間社会を描いています。                              |
+-----------------------+-----------------------------------------------------------------------------------------+
```

## Example 4: Korean (nori) Analyzer

Use the `korean` analyzer to search for Korean text in queries. This analyzer is also known as the nori analyzer.

Create a table, insert data and optimize the table to ensure all data is included in results.

```sql
CREATE TABLE k_drama (
 genre VARCHAR(200),
 movie_name TEXT,
 cast TEXT,
 FULLTEXT USING VERSION 2 (genre) INDEX_OPTIONS
 '{
     "analyzer": "korean"
 }'
);

INSERT INTO k_drama (genre, movie_name, cast)
VALUES
 ('로맨스', '사랑의 불시착', '현빈, 손예진'),
 ('액션, 스릴러', '빈센조', '송중기, 전여빈'),
 ('드라마, 로맨스', '도깨비', '공유, 김고은'),
 ('사극, 드라마', '미스터 션샤인', '이병헌, 김태리'),
 ('코미디, 로맨스', '김비서가 왜 그럴까', '박서준, 박민영');

OPTIMIZE TABLE k_drama FLUSH;

```

Observe the result of the search query for the Korean text below.

```sql
SELECT * FROM k_drama WHERE MATCH(TABLE k_drama) AGAINST("genre:로맨스");


```

```output

+----------------------+----------------------------+----------------------+
| genre                | movie_name                 | cast                 |
+----------------------+----------------------------+----------------------+
| 드라마, 로맨스          | 도깨비                       | 공유, 김고은            |
| 코미디, 로맨스          | 김비서가 왜 그럴까              | 박서준, 박민영          |
| 로맨스                | 사랑의 불시착                  | 현빈, 손예진            |
+----------------------+----------------------------+----------------------+

```

## Example 5: Korean (nori) Analyzer with User Dictionary

Use the `korean` analyzer (also known as the nori analyzer) with and without a user dictionary to search for a Korean text in queries. This example demonstrates how a user dictionary can be used to add words, specifically compound words, to the dictionary used by the `korean` analyzer. Note that the user dictionary is augmentative, meaning that it adds the specified words to the existing dictionary, it does not replace the existing dictionary.

Create two tables with a full text index with the `korean` analyzer, one with and one without a user dictionary. The Korean compound word 수영장, which translates to "swimming pool" in English is inserted in the user dictionary.

```sql
CREATE TABLE korean_user_dict
    (id INT, 
     phrase VARCHAR(400), 
     FULLTEXT USING VERSION 2 (phrase) INDEX_OPTIONS 
       '{"analyzer" : 
         {"custom": 
          {"tokenizer": 
           {"korean": {"userDictionary": ["수영장"]}
           }
         }
        } 
}');
```

```sql
CREATE TABLE korean
     (id INT,
      phrase varchar(400), 
      FULLTEXT USING VERSION 2 (phrase) INDEX_OPTIONS
        '{
           "analyzer": "korean"
         }'
);
```

Insert data into the tables and optimize them to ensure all data is included in results.

```sql
INSERT INTO korean_user_dict 
    VALUES (1, "수영장"), (2, "수영"), (3, "장");
OPTIMIZE TABLE korean_user_dict FLUSH;

```

```sql
INSERT INTO korean 
    VALUES (1, "수영장"), (2, "수영"), (3, "장");
OPTIMIZE TABLE korean FLUSH;

```

When 수영장 is inserted into the `korean_user_dict` table, because 수영장 is in the user dictionary, 수영장 is tokenized as a single, atomic token and will not be further tokenized.

In contrast, when 수영장 is inserted into the `korean` table, 수영장 is tokenized into 수영 and 장. By default, compound words are decomposed, and the original form is discarded (`decompoundMode` is `discard` by default).

## Example 5a - Search for 수영장 with and without User Dictionary

The following queries search for the compound word 수영장 in both the `korean_user_dict` and `korean` tables. This and the following example demonstrate searching with and without a user dictionary.&#x20;

The `ORDER BY` clause is included to ensure consistent ordering of results.

```sql
SELECT id, phrase, BM25(korean_user_dict, 'phrase:(수영장)') AS score 
FROM korean_user_dict 
ORDER BY id;

```

```output

+------+-----------------+--------------------+
| id   | text            | score              |
+------+-----------------+--------------------+
|    1 | 수영장            | 0.4458314776420593 |
|    2 | 수영              |                  0 |
|    3 | 장               |                  0 |
+------+-----------------+--------------------+

```

```sql
SELECT id, phrase, BM25(korean, 'phrase:(수영장)') AS SCORE
FROM korean 
ORDER BY id;

```

```output

+------+-----------------+---------------------+
| id   | text            | score               |
+------+-----------------+---------------------+
|    1 | 수영장            | 0.35471969842910767 |
|    2 | 수영             | 0.23797652125358582 |
|    3 | 장               | 0.23797652125358582 |
+------+-----------------+---------------------+

```

When searching the `korean_user_dict` table, 수영장 matches only 수영장 and not 수영 or 장 because 수영장 is tokenized as a single, atomic token.

In contrast, when searching the `korean` table, 수영장 matches 수영장, 수영, and 장 because 수영장 is tokenized as two tokens: 수영, and 장, hence 수영장 partially matches all three rows in the table.

Finally, the score for 수영장 is higher when searching the table `korean_user_dict` than when searching the table `korean` because 수영장 matches only a single row in `korean_user_dict`.

## Example 5b - Search for 수영 with and without User Dictionary

The following queries search for the word 수영 in the tables with and without user dictionary. The `ORDER BY` clause is included to ensure consistent ordering of results.

```sql
SELECT id, phrase, BM25(korean_user_dict, 'phrase:(수영)') AS SCORE 
FROM korean_user_dict ORDER BY id;

```

```output

+------+-----------------+--------------------+
| id   | text            | SCORE              |
+------+-----------------+--------------------+
|    1 | 수영장            |                  0 |
|    2 | 수영             | 0.4458314776420593 |
|    3 | 장               |                  0 |
+------+-----------------+--------------------+

```

```sql
SELECT id, phrase, BM25(korean, 'phrase:(수영)') AS SCORE
FROM korean ORDER BY id;

```

```output

+------+-----------------+---------------------+
| id   | text            | SCORE               |
+------+-----------------+---------------------+
|    1 | 수영장            | 0.17735984921455383 |
|    2 | 수영             |  0.23797652125358582 |
|    3 | 장               |                   0 |
+------+-----------------+---------------------+


```

When searching the `korean_user_dict` table, 수영 matches only itself, because, as described earlier, 수영장 has been inserted in the user dictionary and 수영장 is tokenized as a single token, so it does not match 수영.

In contrast, when searching the `korean` table, 수영 matches 수영장 and 수영 because 수영장 is tokenized as two tokens: 수영, and 장.

In addition, the score for 수영 is higher when searching the table `korean_user_dict` than when searching the table `korean` because 수영 matches only a single row in `korean_user_dict`.

## Example 6: Custom Analyzer, standard Tokenizer, Custom Token Filters (`elision`, `snowball_porter`) to Search Italian Text

Use a `custom` analyzer, a `standard` tokenizer, `elision` and `snowball_porter` as token filters for the `Italian` language to search for an Italian text.

In this example, `elision` as a token filter removes specific elisions from the input token. Using `snowball_porter` as a token filter stems the words using the Lucene Snowball stemmer tokenization. The `snowball_porter` token filter requires a language parameter to control the stemmer.

Create a table, insert data and optimize the table to ensure all data is included in results.

```sql
CREATE TABLE italian_architecture (
   architecture VARCHAR(400),
   description VARCHAR(400),
   SORT KEY (architecture),
   FULLTEXT USING VERSION 2 KEY(description)
   INDEX_OPTIONS '{"analyzer" : 
         {"custom" : {"tokenizer" : "standard", 
                      "token_filters": ["elision", 
                                       {"snowball_porter" : {"language": "Italian"}}]}}}}'
);

INSERT INTO italian_architecture (architecture, description) VALUES
('Colosseo', 'Un antico anfiteatro situato a Roma, noto per i combattimenti dei gladiatori.'),
('Torre Pendente di Pisa', 'Un campanile famoso per la sua inclinazione non intenzionale.'),
('Basilica di San Pietro', 'Una chiesa rinascimentale in Vaticano, famosa per la sua cupola progettata da Michelangelo.'),
('Duomo di Milano', 'L’architettura di Milano, nota per la sua straordinaria architettura gotica e le guglie.'),
('Palazzo Ducale', 'Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.');

OPTIMIZE TABLE italian_architecture FLUSH;
```

Observe the result of the search query for the Italian text below.

```sql
SELECT * 
FROM italian_architecture 
WHERE MATCH(TABLE italian_architecture) AGAINST("description:l’architettura");


```

```output

+-----------------+--------------------------------------------------------------------------------------------+
| architecture    | description                                                                                |
+-----------------+--------------------------------------------------------------------------------------------+
| Duomo di Milano | L’architettura di Milano, nota per la sua straordinaria architettura gotica e le guglie.   |
| Palazzo Ducale  | Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.        |
+-----------------+--------------------------------------------------------------------------------------------+

```

Use a `custom` analyzer, a `standard` tokenizer, `snowball_porter` as token filters for the `Italian` language without elision token filter to search for an Italian text in queries.

Create a second table, insert data and optimize the table to ensure all data is included in results.

```sql
CREATE TABLE italian_architecture_2 (
   architecture VARCHAR(400),
   description VARCHAR(400),
   SORT KEY (architecture),
   FULLTEXT USING VERSION 2 KEY(description)
   INDEX_OPTIONS '{"analyzer" : 
         {"custom" : {"tokenizer" : "standard", 
                      "token_filters": {"snowball_porter" : {"language": "Italian"}}}}}}'
);

INSERT INTO italian_architecture_2 (architecture, description) VALUES
('Colosseo', 'Un antico anfiteatro situato a Roma, noto per i combattimenti dei gladiatori.'),
('Torre Pendente di Pisa', 'Un campanile famoso per la sua inclinazione non intenzionale.'),
('Basilica di San Pietro', 'Una chiesa rinascimentale in Vaticano, famosa per la sua cupola progettata da Michelangelo.'),
('Duomo di Milano', 'L’architettura di Milano, nota per la sua straordinaria architettura gotica e le guglie.'),
('Palazzo Ducale', 'Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.');

OPTIMIZE TABLE italian_architecture_2 FLUSH;
```

Observe the result of the search query for the Italian text without elision token filter below.

```sql
SELECT * 
FROM italian_architecture_2 
WHERE MATCH(TABLE italian_architecture_2) AGAINST("description:l’architettura");


```

```output

+----------------+---------------------------------------------------------------------------------------+
| architecture   | description                                                                           |
+----------------+---------------------------------------------------------------------------------------+
| Palazzo Ducale | Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.   |
+----------------+---------------------------------------------------------------------------------------+

```

## Example 7: Custom Analyzer and N Gram Tokenizer

Use a `custom` analyzer and a `n_gram` tokenizer to search for misspelled text in queries.

Create a table, insert data and optimize the table to ensure all data is included in results.

```sql
CREATE TABLE university
   (name VARCHAR(400),
   admission_page VARCHAR(400),
   SORT KEY (name),
   FULLTEXT USING VERSION 2 KEY(admission_page)
   INDEX_OPTIONS '{"analyzer" : {"custom" : {"tokenizer" : "n_gram"}}}}');

INSERT INTO university (name, admission_page) VALUES
('Harvard University', 'college.harvard.edu/admissions'),
('Stanford University', 'stanford.edu/admission/'),
('Massachusetts Institute of Technology (MIT)', 'mitadmissions.org/'),
('California Institute of Technology (Caltech)', 'admissions.caltech.edu/'),
('University of Chicago', 'uchicago.edu/en/admissions');

OPTIMIZE TABLE university FLUSH;
```

Observe the result of the search query for the misspelled text and compare the search result with the score below.

```sql
SELECT name,admission_page, MATCH(TABLE university) AGAINST("admission_page:cattec") AS score
FROM university
WHERE score
ORDER BY score DESC;


```

```output

+----------------------------------------------+--------------------------------+---------------------+
| name                                         | admission_page                 | score               |
+----------------------------------------------+--------------------------------+---------------------+
| California Institute of Technology (Caltech) | admissions.caltech.edu/        |  2.4422175884246826 |
| University of Chicago                        | uchicago.edu/en/admissions     |  0.8550153970718384 |
| Harvard University                           | college.harvard.edu/admissions |  0.6825864911079407 |
| Stanford University                          | stanford.edu/admission/        |  0.5768249034881592 |
| Massachusetts Institute of Technology (MIT)  | mitadmissions.org/             | 0.26201900839805603 |
+----------------------------------------------+--------------------------------+---------------------+

```

## Example 8: Custom Analyzer, n\_gram Tokenizer, Custom Character Filter (`html_strip`), and Custom Token Filter (`lower_case`) to Search for HTML entities

Use a `custom` analyzer, `n_gram` tokenizer, `html_strip` as character filter, and `lower_case` as token filter to search for HTML entities in queries.

Create a table, insert data and optimize the table to ensure all data is included in results.

```sql
CREATE TABLE html_table_n_gram (
  title VARCHAR(200),
  content VARCHAR(200),
  FULLTEXT USING VERSION 2 (title, content) INDEX_OPTIONS
 '{
    "analyzer": {
        "custom": {"char_filters": ["html_strip"],
                   "tokenizer": "n_gram",
                   "token_filters":["lower_case"]
                   }
    }
 }'
);

INSERT INTO html_table_n_gram (title, content) VALUES
('Exciting News', 'We&apos;re thrilled to announce our new project!</p>'),
('Learning Journey', 'Learning is a never-ending journey &amp; I&apos;m excited!</p>'),
('Success Story', 'Our team has achieved great things &amp; we&apos;re proud!</p>'),
('Grateful Heart', 'Thank you for being a part of our journey &amp; supporting us!</p>'),
('Future Goals', 'We&apos;re looking forward to achieving even more!</p>');

OPTIMIZE TABLE html_table_n_gram FLUSH;
```

Observe the result of the search query for the misspelled HTML entity and compare the search result with the score below.

```sql
SELECT title,content, MATCH(TABLE html_table_n_gram) AGAINST("content:I',") AS score
FROM html_table_n_gram
WHERE score
ORDER BY score DESC;


```

```output

+------------------+--------------------------------------------------------------------+---------------------+
| title            | content                                                            | score               |
+------------------+--------------------------------------------------------------------+---------------------+
| Learning Journey | Learning is a never-ending journey &amp; I&apos;m excited!</p>     |  0.5430432558059692 |
| Success Story    | Our team has achieved great things &amp; we&apos;re proud!</p>     | 0.31375283002853394 |
| Exciting News    | We&apos;re thrilled to announce our new project!</p>               | 0.26527124643325806 |
| Future Goals     | We&apos;re looking forward to achieving even more!</p>             |  0.2177681028842926 |
| Grateful Heart   | Thank you for being a part of our journey &amp; supporting us!</p> |  0.1819886565208435 |
+------------------+--------------------------------------------------------------------+---------------------+

```

## Example 9: Portuguese Analyzer with score

Use a `portuguese` analyzer to search for a Portuguese text in queries.

Create a table, insert data and optimize the table to ensure all data is included in results.

```sql
CREATE TABLE portuguese_news (
  headline VARCHAR(200),
  content TEXT,
  FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
  '{
      "analyzer": "portuguese"
  }'
);

INSERT INTO portuguese_news (headline, content) VALUES
('Cenário Econômico Brasileiro', 'O Brasil enfrenta desafios econômicos com a inflação em alta e a taxa de desemprego ainda elevada.'),
('Mercado de Ações em Alta', 'As ações brasileiras registraram ganhos significativos, impulsionadas por resultados financeiros positivos de grandes empresas.'),
('Nova Política Monetária do Banco Central', 'O Banco Central do Brasil anunciou mudanças na política monetária para conter a inflação e estimular o crescimento econômico.'),
('Investimentos Estrangeiros no Brasil', 'O país atraiu um aumento de investimentos estrangeiros diretos, especialmente em setores de tecnologia e energia renovável.'),
('Tendências do Mercado Imobiliário', 'O mercado imobiliário brasileiro mostra sinais de recuperação, com aumento nas vendas de imóveis e novos lançamentos.');

OPTIMIZE TABLE portuguese_news FLUSH;
```

Observe the result of the search query for the Portuguese text and compare the search result with the score below.

```sql
SELECT content, MATCH(TABLE portuguese_news) AGAINST ("content:Brasil") AS score
FROM portuguese_news
WHERE score
ORDER BY score DESC;


```

```output

+-------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| content                                                                                                                             | score               |
+-------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| O Brasil enfrenta desafios econômicos com a inflação em alta e a taxa de desemprego ainda elevada.                                  | 0.22189012169837952 |
| O Banco Central do Brasil anunciou mudanças na política monetária para conter a inflação e estimular o crescimento econômico.       |  0.2059776782989502 |
+-------------------------------------------------------------------------------------------------------------------------------------+---------------------+
```

## Example 10: Spanish Analyzer with custom stop words

Use a `spanish` analyzer with custom stop words to search for a Spanish text in queries.

Create a table, insert data and optimize the table to ensure all data is included in results.

```sql
CREATE TABLE spanish_news (
 headline VARCHAR(200),
 content TEXT,
 FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
 '{
     "analyzer": {"spanish": {"stopset": ["descubrimiento", "tratamiento", "nuevo"]}}
 }'
);

INSERT INTO spanish_news (headline, content) VALUES
('Descubrimiento de un nuevo tratamiento para la diabetes', 'Investigadores han desarrollado un tratamiento innovador que mejora el control del azúcar en sangre en pacientes diabéticos.'),
('Avances en la detección temprana del cáncer', 'Un nuevo método permite detectar el cáncer en etapas más tempranas, aumentando las posibilidades de tratamiento exitoso.'),
('Nuevo enfoque para tratar enfermedades cardíacas', 'Se ha introducido un nuevo enfoque terapéutico que reduce significativamente el riesgo de ataques cardíacos.'),
('Investigación sobre un gen relacionado con el Alzheimer', 'Científicos han identificado un gen que podría estar vinculado a la enfermedad de Alzheimer, lo que abre nuevas posibilidades para el tratamiento.'),
('Desarrollo de una vacuna contra COVID-19', 'Un equipo de investigadores ha anunciado resultados prometedores en la efectividad de una nueva vacuna contra COVID-19.');

OPTIMIZE TABLE spanish_news FLUSH;
```

Observe the results of two search queries below: one for the defined Spanish stop word and another for the actual Spanish stop word. The defined stop words above overwrite the actual stop words.

```sql
SELECT * 
FROM spanish_news 
WHERE MATCH(TABLE spanish_news) AGAINST("content:nuevo");


```

```output

Empty set 
```

```sql
SELECT * 
FROM spanish_news 
WHERE MATCH(TABLE spanish_news) AGAINST("content:el");


```

```output

+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| headline                                                 | content                                                                                                                                              |
+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Investigación sobre un gen relacionado con el Alzheimer  | Científicos han identificado un gen que podría estar vinculado a la enfermedad de Alzheimer, lo que abre nuevas posibilidades para el tratamiento.   |
| Avances en la detección temprana del cáncer              | Un nuevo método permite detectar el cáncer en etapas más tempranas, aumentando las posibilidades de tratamiento exitoso.                             |
| Descubrimiento de un nuevo tratamiento para la diabetes  | Investigadores han desarrollado un tratamiento innovador que mejora el control del azúcar en sangre en pacientes diabéticos.                         |
| Nuevo enfoque para tratar enfermedades cardíacas         | Se ha introducido un nuevo enfoque terapéutico que reduce significativamente el riesgo de ataques cardíacos.                                         |
+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
```

## Example 11: Standard Tokenizer with Custom Stop Words

Use a `standard` tokenizer with custom stop words.

Create a procedure to insert data into a table named `t`. This procedure will be used to generate data for the examples in this section.

```sql
DELIMITER //

CREATE OR REPLACE PROCEDURE insert_flush()
AS
BEGIN
   INSERT INTO t VALUES( 1, "On April 23, 2013, SingleStore launched its first generally available version of the database to the public as MemSQL."),

   (2, "Early versions only supported row-oriented tables, and were highly optimized for cases where all data can fit within main memory."),

   (3, "This design was based on the idea that the cost of RAM would continue to decrease exponentially over time, in a trend similar to Moore's law."),

   (4, "This would eventually allow most use cases for database systems to store their data exclusively in memory not on disk."),

   (5, "Shortly after launch, MemSQL added general support for an on-disk column-based storage format to work alongside the in-memory rowstore."),

   (6, "The decreases in cost of memory slowed over time, and the market for purely in-memory database systems largely failed to materialize, with increasing demand for disk-based OLAP workloads."),

   (7, "Thus, over time, MemSQL's columnstore became a major focus and a crucial feature for customers."),

   (8, "On October 27, 2020, MemSQL rebranded to SingleStore to reflect a shift in focus away from exclusively in-memory workloads."),

   (9, "The new name highlights the goal of achieving a universal storage format capable of supporting both transactional and analytical use cases."),

   (10, "In its current product release, v.7.5, SingleStore became the first and only database to combine separation of storage and compute plus system of record into a single platform."),

   (11, "Headquartered in San Francisco, California, in June 2021 singlestore.com opened an office in Raleigh, North Carolina. As part of the office opening, SingleStore launched Launch Pad, a center for innovation to incubate and prototype solutions."),

   (12, "Its other offices include Sunnyvale, California, seattle@singlestore.com, Washington, and Lisbon, Portugal."),

   (13, "seattle@singlestore.com");

   OPTIMIZE TABLE t FLUSH;

END;//

DELIMITER ;
```

## Example 11a: No Stop Words

Create a table without stop words and insert data into that table.

```sql
CREATE TABLE t ( 
  id INT,text VARCHAR(400), 
  SORT KEY (id), 
  FULLTEXT USING VERSION 2 KEY(text) 
    INDEX_OPTIONS 
      '{"analyzer" : 
         {"custom" : {"tokenizer" : "standard"}}
       }');

CALL insert_flush();
```

The two queries below query the table for the words `The` and `the`, respectively. Since there are no stop words, results will be returned for both queries.

```sql
SELECT text, MATCH(table t) AGAINST ('text:The') AS SCORE 
FROM t 
WHERE score 
ORDER BY score DESC;

```

```output

*** 1. row ***
 text: The new name highlights the goal of achieving a universal storage format capable of supporting both transactional and analytical use cases.
SCORE: 0.5822426080703735
*** 2. row ***
 text: The decreases in cost of memory slowed over time, and the market for purely in-memory database systems largely failed to materialize, with increasing demand for disk-based OLAP workloads.
SCORE: 0.4953887462615967
2 rows in set (0.09 sec)

```

```sql
SELECT text, MATCH(TABLE t) AGAINST ('text:the') AS SCORE 
FROM t 
WHERE score 
ORDER BY score DESC;

```

```output

*** 1. row ***
 text: Shortly after launch, MemSQL added general support for an on-disk column-based storage format to work alongside the in-memory rowstore.
SCORE: 0.5361358523368835
*** 2. row ***
 text: On April 23, 2013, SingleStore launched its first generally available version of the database to the public as MemSQL.
SCORE: 0.216837078332901
*** 3. row ***
 text: This design was based on the idea that the cost of RAM would continue to decrease exponentially over time, in a trend similar to Moore's law.
SCORE: 0.19964565336704254
*** 4. row ***
 text: The new name highlights the goal of achieving a universal storage format capable of supporting both transactional and analytical use cases.
SCORE: 0.1568010002374649
*** 5. row ***
 text: In its current product release, v.7.5, SingleStore became the first and only database to combine separation of storage and compute plus system of record into a single platform.
SCORE: 0.13726447522640228
*** 6. row ***
 text: The decreases in cost of memory slowed over time, and the market for purely in-memory database systems largely failed to materialize, with increasing demand for disk-based OLAP workloads.
SCORE: 0.1351594626903534
*** 7. row ***
 text: Headquartered in San Francisco, California, in June 2021 singlestore.com opened an office in Raleigh, North Carolina. As part of the office opening, SingleStore launched Launch Pad, a center for innovation to incubate and prototype solutions.
SCORE: 0.12553386390209198

```

Drop the table so that the tablename `t` can be used with the subsequent examples.

```sql
DROP TABLE t;
```

## Example 11b: Default English Stop Words

Create a table using default English stop words and query that table for the words `The` and `the`. The words `The` and `the` are included in default English stop words. Thus, neither query returns results.

```sql
CREATE TABLE t (
  id INT, text VARCHAR(400), 
  SORT KEY (id), 
  FULLTEXT USING VERSION 2 KEY (text) INDEX_OPTIONS 
   '{"analyzer" : 
      {"custom" : 
        {"tokenizer" : "standard", token_filters: ["stop"]}
     }}');

CALL insert_flush();
```

```sql
SELECT text, MATCH(table t) AGAINST ('text:The') AS SCORE 
FROM t 
WHERE score 
ORDER BY score DESC;

```

```output

Empty set (0.01 sec)

```

```sql
SELECT text, MATCH(TABLE t) AGAINST ('text:the') AS SCORE 
FROM t 
WHERE score 
ORDER BY score DESC;

```

```output

Empty set (0.01 sec)
```

```sql
DROP TABLE t;
```

## Example 11c: Custom English Stop Words

Create a table using custom English stop words and insert the word `the` as a stop word. And query the table for the words `The` and `the`. By default, when using the custom stop words token filter, case is ignored, so no results are returned for either query.

```sql
CREATE TABLE t (
  id INT, text VARCHAR(400), 
  SORT KEY (id), 
  FULLTEXT USING VERSION 2 KEY (text) index_options 
   '{"analyzer" : 
      {"custom" : 
        {"tokenizer" : "standard", token_filters: [{"stop": {"words": ["the"]}}]}
     }}');

CALL insert_flush();
```

```sql
SELECT text, MATCH(table t) AGAINST ('text:The') AS SCORE 
FROM t 
WHERE score 
ORDER BY score DESC;

```

```output

Empty set (0.06 sec)

```

```sql
SELECT text, MATCH(TABLE t) AGAINST ('text:the') AS SCORE 
FROM t 
WHERE score 
ORDER BY score DESC;

```

```output

Empty set (0.05 sec)

```

```sql
DROP TABLE t;
```

## Example 11d: Custom English Stop Words - Do Not Ignore Case

In the following, the query is modified so case is not ignored. In this example, results are returned for the first query, but not the second.

```sql
CREATE TABLE t (
  id INT, text VARCHAR(400), 
  SORT KEY (id), 
  FULLTEXT USING VERSION 2 KEY (text) INDEX_OPTIONS 
   '{"analyzer" : 
      {"custom" : 
        {"tokenizer" : "standard", 
          token_filters: [{"stop": {"ignoreCase": false,"words": ["the"]}}]}
     }}');
  
CALL insert_flush();

```

```sql
SELECT text, MATCH(table t) AGAINST ('text:The') AS SCORE 
FROM t 
WHERE score 
ORDER BY score DESC;

```

```output

*** 1. row ***
 text: The new name highlights the goal of achieving a universal storage format capable of supporting both transactional and analytical use cases.
SCORE: 0.7002022862434387
*** 2. row ***
 text: The decreases in cost of memory slowed over time, and the market for purely in-memory database systems largely failed to materialize, with increasing demand for disk-based OLAP workloads.
SCORE: 0.6494264602661133
2 rows in set (0.06 sec)

```

```sql
SELECT text, MATCH(TABLE t) AGAINST ('text:the') AS SCORE 
FROM t 
WHERE score 
ORDER BY score DESC;

```

```output

Empty set (0.03 sec)

```

```sql
DROP TABLE t;
```

## Example 12 - Use ngrams to Find a String that Contains a Substring

The `n_gram` tokenizer can be used to search for strings that contain a specific substring.&#x20;

That is, you can use ngrams to do a search that is equivalent to having wildcards at the beginning and end of a search term, as shown in the following example.

Create a table with a full-text index that uses the `n_gram` tokenizer and insert data into that table.

```sql
CREATE TABLE university
   (name VARCHAR(400),
   admission_page VARCHAR(400),
   SORT KEY (name),
   FULLTEXT USING VERSION 2 KEY(admission_page)
   INDEX_OPTIONS '{"analyzer" : {"custom" : {"tokenizer" : "n_gram"}}}}');

INSERT INTO university (name, admission_page) VALUES
('Harvard University', 'college.harvard.edu/admissions'),
('Stanford University', 'stanford.edu/admission/'),
('Massachusetts Institute of Technology (MIT)', 'mitadmissions.org/'),
('California Institute of Technology (Caltech)', 'admissions.caltech.edu/'),
('University of Chicago', 'uchicago.edu/en/admissions');

OPTIMIZE TABLE university FLUSH;
```

The following statements and query perform the equivalent of a match against `%arvar%`. The substring `arvar` is matched because the `n_gram` tokenizer splits words into small pieces. As a result, the name Harvard matches `arvar` with the highest ranked score.

```sql
SET sql_mode = pipes_as_concat;
SET @q = "admission_page:" || "arvar";

SELECT *, MATCH(TABLE university) AGAINST(@q) as score
FROM university
WHERE MATCH(TABLE university) AGAINST(@q)
ORDER BY score DESC;

```

```output

+----------------------------------------------+--------------------------------+---------------------+
| name                                         | admission_page                 | score               |
+----------------------------------------------+--------------------------------+---------------------+
| Harvard University                           | college.harvard.edu/admissions |   2.921722650527954 |
| Massachusetts Institute of Technology (MIT)  | mitadmissions.org/             |  0.5870963335037231 |
| Stanford University                          | stanford.edu/admission/        |  0.5820327997207642 |
| University of Chicago                        | uchicago.edu/en/admissions     |  0.2338818907737732 |
| California Institute of Technology (Caltech) | admissions.caltech.edu/        | 0.16432610154151917 |
+----------------------------------------------+--------------------------------+---------------------+

```

In addition, you can use a common table expression (CTE) and `LIKE`, to obtain partial string and substring matches with the speed of a full-text index. The query below runs faster than a query with only a `LIKE` expression.

```sql
WITH matches AS (
    SELECT *, MATCH(TABLE university) AGAINST(@q) AS score
    FROM university
    WHERE MATCH(TABLE university) AGAINST(@q)
    ORDER BY score DESC
    LIMIT 10
)
SELECT * 
FROM matches
WHERE admission_page LIKE '%arvar%';

```

```output

+--------------------+--------------------------------+-------------------+
| name               | admission_page                 | score             |
+--------------------+--------------------------------+-------------------+
| Harvard University | college.harvard.edu/admissions | 2.921722650527954 |
+--------------------+--------------------------------+-------------------+

```

## Example 13: Custom Column Mappings

## Example 13a: Per-Column Analyzers

Create a table that uses the `french` analyzer for the `french_content` column, the `english` analyzer for the `english_content` column, and the `standard` analyzer for all other columns (`title`).

```sql
CREATE TABLE custom_col_analyzers (
    title VARCHAR(200),
    french_content VARCHAR(200),
    english_content VARCHAR(200),
    FULLTEXT USING VERSION 2 (title, french_content, english_content)
    INDEX_OPTIONS
        '{
            "analyzer": "standard",
            "mappings": {
                "french_content": {
                    "analyzer": "french"
                 },
                 "english_content": {
                     "analyzer": "english"
                 }
             }
         }'
);
```

Insert data into the table and optimize the table to ensure all the data is indexed.

```sql
INSERT INTO custom_col_analyzers 
VALUES 
    ("fast", "Nous sommes la base de données la plus rapide.", 
                        "We are the fastest database."),
    ("slow", "Cette base de données est un peu lente.", 
                       "This database is kind of slow."),
    ("slowest", "Cette base de données est la plus lente.", 
                       "This database is the slowest.");

OPTIMIZE TABLE custom_col_analyzers FLUSH;
```

The following query searches the columns using the per-column analyzers defined earlier and returns any row in which the `french_content` column contains `rapide` or the `english_content` column contains `fastest`.

```sql
SELECT title,
MATCH(TABLE custom_col_analyzers) 
AGAINST ('french_content:(rapide) OR english_content:(fastest)') AS score 
FROM custom_col_analyzers 
WHERE score > 0
ORDER BY score DESC;

```

```output

+-------+--------------------+
| title | score              |
+-------+--------------------+
| fast  | 0.6301337480545044 |
+-------+--------------------+
```

In this query, the `french_content` column is searched for `rapide` using the index built with the `french` analyzer while the `english_content` column is searched for `fast` using the index built with the `english` analyzer.

The following query returns any row in which the `french_content` column contains `rapide` or the `english_content` column contains `slowest`.

```sql
SELECT title,
MATCH(TABLE custom_col_analyzers) 
AGAINST ('french_content:(rapide) OR english_content:(slowest)') AS score 
FROM custom_col_analyzers 
WHERE score > 0
ORDER BY score DESC;

```

```output

+---------+---------------------+
| title   | score               |
+---------+---------------------+
| slowest | 0.49662238359451294 |
| fast    |  0.4458314776420593 |
+---------+---------------------+
```

Every `MATCH` query requires a prefix that specifies which column to search, thus each column is searched using the appropriate analyzer.

## Example 13b: JSON and BSON Keypath Analyzers

Create a table that indexes the `json_column$english_content` field with the `english` analyzer and indexes the `json_column$french_content` field with the `french` analyzer.

```sql
CREATE TABLE json_keypath_analyzers (
    json_column JSON,
    FULLTEXT USING VERSION 2 KEY(json_column)
    INDEX_OPTIONS '{
        "mappings": {
            "json_column$english_content": {
                "analyzer": "english"
             },
            "json_column$french_content": {
                 "analyzer": "french"
             }
        }
    }'
);
```

Insert data into the table and optimize the table to ensure all the data is indexed.

```sql
INSERT INTO json_keypath_analyzers
VALUES 
('{"english_content": "We are the fastest database."}'),
('{"french_content": "Nous sommes la base de données la plus rapide."}'),
('{"english_content": "This database is kind of slow."}'),
('{ "french_content": "Cette base de données est un peu lente."}'),
('{"english_content": "his database is the slowest."}'),
('{"french_content": "Cette base de données est la plus lente."}');

OPTIMIZE TABLE json_keypath_analyzers FLUSH;

```

The following query searches the `english_content` field using the `english` analyzer, and the `french_content` field using the `french` analyzer. The query returns all the rows with `fastest` in the `english_content` field and `rapide` in the `french_content` field.

```sql
SELECT json_column,
(MATCH (TABLE json_keypath_analyzers) 
 AGAINST ('json_column$english_content:fastest
          OR json_column$french_content:rapide')
) AS score
FROM json_keypath_analyzers
WHERE score > 0
ORDER BY score DESC;

```

```output

+--------------------------------------------------------------------+--------------------+
| json_column                                                        | score              |
+--------------------------------------------------------------------+--------------------+
| {"english_content":"We are the fastest database."}                 | 0.3150668740272522 |
| {"french_content":"Nous sommes la base de donnes la plus rapide."} | 0.3150668740272522 |
+--------------------------------------------------------------------+--------------------+

```

The following query searches only for the word `fastest` in the `english_content` field.&#x20;

```sql
SELECT json_column,
(MATCH(TABLE json_keypath_analyzers) 
 AGAINST ('json_column$english_content:fastest')
) AS score
FROM json_keypath_analyzers
WHERE score > 0
ORDER BY score DESC;


```

```output

+----------------------------------------------------+--------------------+
| json_column                                        | score              |
+----------------------------------------------------+--------------------+
| {"english_content":"We are the fastest database."} | 0.3150668740272522 |
+----------------------------------------------------+--------------------+

```

## Supported Language Analyzers

The following table lists the supported language analyzers.

| **Language**           | **Default Stop Word List Link**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `arabic`               | [Apache Lucene Arabic Stop Words](https://github.com/apache/lucene/blob/cfdd20f5bc8387ba24653ca2ba15aa5be10d0ae0/lucene/analysis/common/src/resources/org/apache/lucene/analysis/ar/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| `bulgarian`            | [Apache Lucene Bulgarian Stop Words](https://github.com/apache/lucene/blob/cfdd20f5bc8387ba24653ca2ba15aa5be10d0ae0/lucene/analysis/common/src/resources/org/apache/lucene/analysis/bg/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| `bengali`              | [Apache Lucene Bengali Stop Words](https://github.com/apache/lucene/blob/cfdd20f5bc8387ba24653ca2ba15aa5be10d0ae0/lucene/analysis/common/src/resources/org/apache/lucene/analysis/bn/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| `brazilian_portuguese` | [Apache Lucene Brazilian, Portuguese Stop Words](https://github.com/apache/lucene/blob/cfdd20f5bc8387ba24653ca2ba15aa5be10d0ae0/lucene/analysis/common/src/resources/org/apache/lucene/analysis/br/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                       |
| `catalan`              | [Apache Lucene Catalan Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/ca/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| `cjk`                  | [Apache Lucene CJK Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/cjk/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| `sorani_kurdish`       | [Apache Lucene Sorani, Kurdish Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/ckb/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                            |
| `czech`                | [Apache Lucene Czech Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/cz/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| `danish`               | [Apache Lucene Danish Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/danish_stop.txt)                                                                                                                                                                                                                                                                                                                                                                                                                              |
| `german`               | [Apache Lucene German Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/german_stop.txt)                                                                                                                                                                                                                                                                                                                                                                                                                              |
| `greek`                | [Apache Lucene Greek Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/el/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| `english`              | "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"                                                                                                                                                                                                                                                                                                                                                                                                     |
| `spanish`              | [Apache Lucene Spanish Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/spanish_stop.txt)                                                                                                                                                                                                                                                                                                                                                                                                                            |
| `estonian`             | [Apache Lucene Estonian Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/et/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| `basque`               | [Apache Lucene Basque Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/eu/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| `persian`              | [Apache Lucene Persian Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/fa/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| `finnish`              | [Apache Lucene Finnish Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/finnish_stop.txt)                                                                                                                                                                                                                                                                                                                                                                                                                            |
| `french`               | [Apache Lucene French Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt)                                                                                                                                                                                                                                                                                                                                                                                                                              |
| `irish`                | [Apache Lucene Irish Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/irish_stop.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                |
| `galician`             | [Apache Lucene Galician Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/gl/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| `hindi`                | [Apache Lucene Hindi Stop Words](https://github.com/apache/lucene/blob/13285279c2d193fe6ad3f323046dd53bbdc8dd4a/lucene/analysis/common/src/resources/org/apache/lucene/analysis/hi/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| `hungarian`            | [Apache Lucene Hungarian Stop Words](https://github.com/apache/lucene/blob/13285279c2d193fe6ad3f323046dd53bbdc8dd4a/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/hungarian_stop.txt)                                                                                                                                                                                                                                                                                                                                                                                                                        |
| `armenian`             | [Apache Lucene Armenian Stop Words](https://github.com/apache/lucene/blob/13285279c2d193fe6ad3f323046dd53bbdc8dd4a/lucene/analysis/common/src/resources/org/apache/lucene/analysis/hy/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| `indonesian`           | [Apache Lucene Indonesian Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/indonesian_stop.txt)                                                                                                                                                                                                                                                                                                                                                                                                                      |
| `italian`              | [Apache Lucene Italian Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/italian_stop.txt)                                                                                                                                                                                                                                                                                                                                                                                                                            |
| `korean`               | This is Apache Lucene's Korean (Nori) Analyzer. Filters tokens based on part-of-speech tags: EF, EC, ETN, ETM, IC, JKS, JKC, JKG, JKO, JKB, JKV, JKQ, JX, JC, MAG, MAJ, MM, SP, SSC, SSO, SC, SE, XPN, XSA, XSN, XSV, UNA, NA, VSV.[Part of speech tags.](https://lucene.apache.org/core/10_0_0/analysis/nori/org/apache/lucene/analysis/ko/POS.Tag.html)Custom stop word lists are not supported with the`korean`analyzer[Lucene nori API](https://lucene.apache.org/core/10_0_0/analysis/nori/).[Lucene Analyzer for Korean](https://lucene.apache.org/core/10_0_0/analysis/nori/org/apache/lucene/analysis/ko/package-summary.html). |
| `lithuanian`           | [Apache Lucene Lithuanian Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/lt/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| `latvian`              | [Apache Lucene Latvian Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/lv/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| `nepali`               | [Apache Lucene Nepali Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/ne/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| `dutch`                | [Apache Lucene Dutch Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/dutch_stop.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                |
| `norwegian`            | [Apache Lucene Norwegian Stop Words](https://github.com/apache/lucene/blob/539cf3c9a335bccb50a0bddbf8cabd2738727528/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/norwegian_stop.txt)                                                                                                                                                                                                                                                                                                                                                                                                                        |
| `portuguese`           | [Apache Lucene Portuguese Stop Words](https://github.com/apache/lucene/blob/13285279c2d193fe6ad3f323046dd53bbdc8dd4a/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/portuguese_stop.txt)                                                                                                                                                                                                                                                                                                                                                                                                                      |
| `romanian`             | [Apache Lucene Romanian Stop Words](https://github.com/apache/lucene/blob/13285279c2d193fe6ad3f323046dd53bbdc8dd4a/lucene/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| `russian`              | [Apache Lucene Russian Stop Words](https://github.com/apache/lucene/blob/13285279c2d193fe6ad3f323046dd53bbdc8dd4a/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/russian_stop.txt)                                                                                                                                                                                                                                                                                                                                                                                                                            |
| `serbian`              | [Apache Lucene Serbian Stop Words](https://github.com/apache/lucene/blob/13285279c2d193fe6ad3f323046dd53bbdc8dd4a/lucene/analysis/common/src/resources/org/apache/lucene/analysis/sr/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| `swedish`              | [Apache Lucene Swedish Stop Words](https://github.com/apache/lucene/blob/13285279c2d193fe6ad3f323046dd53bbdc8dd4a/lucene/analysis/common/src/resources/org/apache/lucene/analysis/snowball/swedish_stop.txt)                                                                                                                                                                                                                                                                                                                                                                                                                            |
| `tamil`                | [Apache Lucene Tamil Stop Words](https://github.com/apache/lucene/blob/13285279c2d193fe6ad3f323046dd53bbdc8dd4a/lucene/analysis/common/src/resources/org/apache/lucene/analysis/ta/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| `telugu`               | [Apache Lucene Telugu Stop Words](https://github.com/apache/lucene/blob/13285279c2d193fe6ad3f323046dd53bbdc8dd4a/lucene/analysis/common/src/resources/org/apache/lucene/analysis/te/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| `thai`                 | [Apache Lucene Thai Stop Words](https://github.com/apache/lucene/blob/13285279c2d193fe6ad3f323046dd53bbdc8dd4a/lucene/analysis/common/src/resources/org/apache/lucene/analysis/th/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| `turkish`              | [Apache Lucene Turkish Stop Words](https://github.com/apache/lucene/blob/13285279c2d193fe6ad3f323046dd53bbdc8dd4a/lucene/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwords.txt)                                                                                                                                                                                                                                                                                                                                                                                                                                     |

## Supported Tokenizers

The table below lists supported tokenizers. These tokenizers may have custom parameters, which can be obtained and used as described below.

## Get Parameters

The parameters and description of each of these tokenizers can be obtained from the links included in the table.

## Example: Get parameters for the `uax_url_email` tokenizer

To obtain the parameters for the `uax_url_email` tokenizer, follow the tokenizer factory link for the `uax_url_email` tokenizer, which can be found in the middle column of the table below.

The following is the tokenizer factory from the `uax_url_email` tokenizer, which has been obtained from the tokenizer factory link. This tokenizer has one parameter `maxTokenLength`, which defaults to 255.

```sql
<fieldType name="text_urlemail" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.UAX29URLEmailTokenizerFactory" maxTokenLength="255"/>
   </analyzer>
 </fieldType>
```

The `INDEX_OPTIONS` string to create a full-text index with the `uax_url_email` tokenizer specifying a `maxTokenLength` of `300` is shown below.

```sql
INDEX_OPTIONS '{
    "analyzer": {
   	  "custom": {
              "tokenizer": {
                   "uax_url_email" : {
	                 "maxTokenLength": 300
                   }
               }
   	  }
    }
}'

```

## List of Supported Tokenizers

| **"tokenizer" (Case-Sensitive)** | **Tokenizer Factory Link (Includes Parameters)**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | **Tokenizer Class Link (Includes Description)**                                                                                                          |
| -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `uax_url_email`                  | [UAX29URLEmailTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizerFactory.html)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | [UAX29URLEmailTokenizer](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/standard/UAX29URLEmailTokenizer.html)          |
| `whitespace`                     | [WhitespaceTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/core/WhitespaceTokenizerFactory.html)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | [WhitespaceTokenizer](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/core/WhitespaceTokenizer.html)                    |
| `classic`                        | [ClassicTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizerFactory.html)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | [ClassicTokenizer](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/standard/ClassicTokenizer.html)                      |
| `simple_pattern`                 | [SimplePatternTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/pattern/SimplePatternTokenizerFactory.html)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | [SimplePatternTokenizer](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/pattern/SimplePatternTokenizer.html)           |
| `standard`                       | [StandardTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/standard/StandardTokenizerFactory.html)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | [StandardTokenizer](https://lucene.apache.org/core/8_3_0/core/org/apache/lucene/analysis/standard/StandardTokenizer.html?is-external=true)               |
| `keyword`                        | [KeywordTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/core/KeywordTokenizerFactory.html)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | [KeywordTokenizer](https://lucene.apache.org/core/8_3_0/core/org/apache/lucene/analysis/standard/StandardTokenizer.html?is-external=true)                |
| `letter`                         | [LetterTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/core/LetterTokenizerFactory.html)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | [LetterTokenizer](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/core/LetterTokenizer.html)                            |
| `simple_pattern_split`           | [SimplePatternSplitTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/pattern/SimplePatternSplitTokenizerFactory.html)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | [SimplePatternSplitTokenizer](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/pattern/SimplePatternSplitTokenizer.html) |
| `pattern`                        | [PatternTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternTokenizerFactory.html)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | [PatternTokenizer](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternTokenizer.html)                       |
| `thai`                           | [ThaiTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/th/ThaiTokenizerFactory.html)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | [ThaiTokenizer](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/th/ThaiTokenizer.html)                                  |
| `edge_n_gram`                    | [EdgeNGramTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenizerFactory.html)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | [EdgeNGramTokenizer](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramTokenizer.html)                     |
| `n_gram`                         | [NGramTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenizerFactory.html)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | [NGramTokenizer](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramTokenizer.html)                             |
| `wikipedia`                      | [WikipediaTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/wikipedia/WikipediaTokenizerFactory.html)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | [WikipediaTokenizer](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/wikipedia/WikipediaTokenizer.html)                 |
| `path_hierarchy`                 | [PathHierarchyTokenizerFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/path/PathHierarchyTokenizerFactory.html)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | [PathHierarchyTokenizer](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/path/PathHierarchyTokenizer.html)              |
| `korean`                         | *Description*: Tokenizer for Korean that uses morphological analysis.*Supports the following attributes*:<ul> <li><a href="https://lucene.apache.org/core/10_0_0/analysis/nori/org/apache/lucene/analysis/ko/dict/UserDictionary.html">userDictionary</a> (JSON array of strings): A JSON array of strings; each string is a term in the dictionary.</li> <li><a href="https://lucene.apache.org/core/10_0_0/analysis/nori/org/apache/lucene/analysis/ko/KoreanTokenizer.DecompoundMode.html">decompoundMode</a> (JSON string): determines how the tokenizer handles POS.Type.COMPOUND, POS.Type.INFLECT, and POS.Type.PREANALYSIS tokens. Values can be 'none', 'discard', 'mixed', the default is 'discard'.</li> <li>outputUnknownUnigrams (JSON boolean value): If "true" outputs unigrams for unknown words.</li> <li>discardPunctuation (JSON boolean value): If "true", punctuation tokens are dropped from the output.</li> </ul> |                                                                                                                                                          |

## Supported Token Filters

This table lists the supported token filters, the filter name and a link for the token filter factory documentation which provides parameters and description for the token filter.

| **"token\_filters" (Case-Sensitive)** | **Lucene Link for Parameters and Description**                                                                                                                                                                                                 |
| ------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `russian_light_stem`                  | [RussianLightStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/ru/RussianLightStemFilterFactory.html)                                                                                        |
| `scandinavian_normalization`          | [ScandinavialnNormalizationFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianNormalizationFilterFactory.html)                                                          |
| `decimal_digit`                       | [DecimalDigitFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/core/DecimalDigitFilterFactory.html)                                                                                              |
| `ascii_folding`                       | [ASCIIFoldingFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilterFactory.html)                                                                                     |
| `german_stem`                         | [GermanStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/de/GermanStemFilterFactory.html)                                                                                                    |
| `bulgarian_stem`                      | [BulgarianStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/bg/BulgarianStemFilterFactory.html)                                                                                              |
| `codepoint_count`                     | [CodepointCountFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/CodepointCountFilterFactory.html)                                                                                 |
| `pattern_replace`                     | [PatternReplaceFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceFilterFactory.html)                                                                                       |
| `persian_normalization`               | [PersianNormalizationFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/fa/PersianNormalizationFilterFactory.html)                                                                                |
| `limit_token_position`                | [LimitTokenPositionFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/LimitTokenPositionFilterFactory.html)                                                                         |
| `porter_stem`                         | [PorterStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilterFactory.html)                                                                                                    |
| `greek_stem`                          | [GreekStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/el/GreekStemFilterFactory.html)                                                                                                      |
| `finnish_light_stem`                  | [FinnishLightStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/fi/FinnishLightStemFilterFactory.html)                                                                                        |
| `fingerprint`                         | [FingerprintFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/FingerprintFilterFactory.html)                                                                                       |
| `cjk_width`                           | [CJKWidthFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthFilterFactory.html)                                                                                                       |
| `reverse_string`                      | [ReverseStringFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/reverse/ReverseStringFilterFactory.html)                                                                                         |
| `common_grams`                        | [CommonGramsFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/commongrams/CommonGramsFilterFactory.html)                                                                                         |
| `delimited_boost_token`               | [DelimitedBoostTokenFilterFactory](https://lucene.apache.org/core/8_9_0/analyzers-common/org/apache/lucene/analysis/boost/DelimitedBoostTokenFilterFactory.html)                                                                               |
| `scandinavian_folding`                | [ScandinavianFoldingFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ScandinavianFoldingFilterFactory.html)                                                                       |
| `hindi_stem`                          | [HindiStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/hi/HindiStemFilterFactory.html)                                                                                                      |
| `spanish_plural_stem`                 | [SpanishPluralStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/es/SpanishMinimalStemFilterFactory.html)                                                                                     |
| `indonesian_stem`                     | [IndonesianStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/id/IndonesianStemFilterFactory.html)                                                                                            |
| `trim`                                | [TrimFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TrimFilterFactory.html)                                                                                                     |
| `french_light_stem`                   | [FrenchLightStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/fr/FrenchLightStemFilterFactory.html)                                                                                          |
| `classic`                             | [ClassicFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/standard/ClassicFilterFactory.html)                                                                                                    |
| `fixed_shingle`                       | [FixedShingleFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/shingle/FixedShingleFilterFactory.html)                                                                                           |
| `english_possessive`                  | [EnglishPossessiveFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/en/EnglishPossessiveFilterFactory.html)                                                                                      |
| `german_normalization`                | [GermanNormalizationFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilterFactory.html)                                                                                  |
| `keyword_repeat`                      | [KeywordRepeatFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilterFactory.html)                                                                                   |
| `min_hash`                            | [MinHashFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/minhash/MinHashFilterFactory.html)                                                                                                     |
| `remove_duplicates_token`             | [RemoveDuplicatesTokenFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilterFactory.html)                                                                   |
| `snowball_porter`                     | [SnowballPorterFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/snowball/SnowballPorterFilterFactory.html)                                                                                      |
| `german_minimal_stem`                 | [GermanMinimalStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/de/GermanMinimalStemFilterFactory.html)                                                                                      |
| `norwegian_light_stem`                | [NorwegianLightStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/no/NorwegianLightStemFilterFactory.html)                                                                                    |
| `english_minimal_stem`                | [EnglishMinimalStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/en/EnglishMinimalStemFilterFactory.html)                                                                                    |
| `norwegian_minimal_stem`              | [NorwegianMinimalStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/no/NorwegianMinimalStemFilterFactory.html)                                                                                |
| `czech_stem`                          | [CzechStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/cz/CzechStemFilterFactory.html)                                                                                                      |
| `sorani_stem`                         | [SoraniStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/ckb/SoraniStemFilterFactory.html)                                                                                                   |
| `limit_token_offset`                  | [LimitTokenOffsetFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/LimitTokenOffsetFilterFactory.html)                                                                             |
| `persian_stem`                        | [PersianStemFilterFactory](https://lucene.apache.org/core/9_9_1/analysis/common/org/apache/lucene/analysis/fa/PersianStemFilterFactory.html)                                                                                                   |
| `common_grams_query`                  | [CommonGramsQueryFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/commongrams/CommonGramsFilterFactory.html)                                                                                    |
| `sorani_normalization`                | [SoraniNormalizationFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/ckb/SoraniNormalizationFilterFactory.html)                                                                                 |
| `swedish_light_stem`                  | [SwedishLightStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/sv/SwedishLightStemFilterFactory.html)                                                                                        |
| `k_stem`                              | [KStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/en/KStemFilterFactory.html)                                                                                                              |
| `french_minimal_stem`                 | [FrenchMinimalStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/fr/FrenchMinimalStemFilterFactory.html)                                                                                      |
| `hyphenated_words`                    | [HyphenatedWordsFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/HyphenatedWordsFilterFactory.html)                                                                               |
| `capitalization`                      | [CapitalizationFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/CapitalizationFilterFactory.html)                                                                                 |
| `lower_case`                          | [LowerCaseFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/core/LowerCaseFilterFactory.html)                                                                                                    |
| `hungarian_light_stem`                | [HungarianLightStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/hu/HungarianLightStemFilterFactory.html)                                                                                    |
| `telugu_stem`                         | [SynonymGraphFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymGraphFilterFactory.html)                                                                                           |
| `italian_light_stem`                  | [ItalianLightStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/it/ItalianLightStemFilterFactory.html)                                                                                        |
| `limit_token_count`                   | [LimitTokenCountFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/LimitTokenCountFilterFactory.html)                                                                               |
| `swedish_minimal_stem`                | [SwedishLightStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/sv/SwedishLightStemFilterFactory.html)                                                                                        |
| `galician_minimal_stem`               | [GalicianMinimalStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/gl/GalicianMinimalStemFilterFactory.html)                                                                                  |
| `portuguese_minimal_stem`             | [PortugueseMinimalStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/pt/PortugueseMinimalStemFilterFactory.html)                                                                              |
| `bengali_normalization`               | [BengaliNormalizationFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/bn/BengaliNormalizationFilterFactory.html)                                                                                |
| `galician_stem`                       | [GalicianStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/gl/GalicianStemFilterFactory.html)                                                                                                |
| `turkish_lower_case`                  | [TurkishLowerCaseFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/tr/TurkishLowerCaseFilterFactory.html)                                                                                        |
| `bengali_stem`                        | [BengaliStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/bn/BengaliStemFilterFactory.html)                                                                                                  |
| `indic_normalization`                 | [IndicNormalizationFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/in/IndicNormalizationFilterFactory.html)                                                                                    |
| `keep_word`                           | [KeepWordFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeepWordFilterFactory.html)                                                                                             |
| `drop_if_flagged`                     | [DictionaryCompoundWordTokenFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html)                                                            |
| `latvian_stem`                        | [LatvianStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/lv/LatvianStemFilterFactory.html)                                                                                                  |
| `portuguese_light_stem`               | [PortugueseLightStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/pt/PortugueseLightStemFilterFactory.html)                                                                                  |
| `apostrophe`                          | [ApostropheFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/tr/ApostropheFilterFactory.html)                                                                                                    |
| `arabic_stem`                         | [ArabicStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/ar/ArabicStemFilterFactory.html)                                                                                                    |
| `delimited_term_frequency_token`      | [DelimitedTermFrequencyTokenFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/DelimitedTermFrequencyTokenFilterFactory.html)                                                       |
| `irish_lower_case`                    | [IrishLowerCaseFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/ga/IrishLowerCaseFilterFactory.html)                                                                                            |
| `edge_n_gram`                         | [EdgeNGramFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/ngram/EdgeNGramFilterFactory.html)                                                                                                   |
| `german_light_stem`                   | [GermanLightStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/de/GermanLightStemFilterFactory.html)                                                                                          |
| `pattern_capture_group`               | [PatternCaptureGroupFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternCaptureGroupFilterFactory.html)                                                                             |
| `spanish_light_stem`                  | [SpanishLightStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/es/SpanishLightStemFilterFactory.html)                                                                                        |
| `hindi_normalization`                 | [HindiNormalizationFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/hi/HindiNormalizationFilterFactory.html)                                                                                    |
| `norwegian_normalization`             | [NorwegianNormalizationFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/no/NorwegianMinimalStemFilterFactory.html)                                                                              |
| `shingle`                             | [ShingleFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilterFactory.html)                                                                                                     |
| `telugu_normalization`                | [SynonymGraphFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/synonym/SynonymGraphFilterFactory.html)                                                                                           |
| `date_recognizer`                     | [DateRecognizerFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/DateRecognizerFilterFactory.html)                                                                                 |
| `n_gram`                              | [NGramFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/ngram/NGramFilterFactory.html)                                                                                                           |
| `upper_case`                          | [UpperCaseFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/core/UpperCaseFilterFactory.html)                                                                                                    |
| `brazilian_stem`                      | [BrazilianStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/br/BrazilianStemFilterFactory.html)                                                                                              |
| `cjk_bigram`                          | [CJKBigramFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilterFactory.html)                                                                                                     |
| `truncate_token`                      | [TruncateTokenFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TruncateTokenFilterFactory.html)                                                                                   |
| `greek_lower_case`                    | [GreekLowerCaseFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/el/GreekLowerCaseFilterFactory.html)                                                                                            |
| `length`                              | [LengthFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/LengthFilterFactory.html)                                                                                                 |
| `arabic_normalization`                | [ArabicNormalizationFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/ar/ArabicNormalizationFilterFactory.html)                                                                                  |
| `portuguese_stem`                     | [PortugueseStemFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/pt/PortugueseStemFilterFactory.html)                                                                                            |
| `elision`                             | [ElisionFilterFactory](https://lucene.apache.org/core/8_4_0/analyzers-common/org/apache/lucene/analysis/util/ElisionFilterFactory.html)                                                                                                        |
| `korean_part_of_speech`               | [KoreanPartOfSpeechStopFilterFactory](https://lucene.apache.org/core/10_0_0/analysis/nori/org/apache/lucene/analysis/ko/KoreanPartOfSpeechStopFilterFactory.html)A token filter that removes tokens that match a set of part-of-speech tags    |
| `korean_reading_form`                 | [KoreanReadingFormFilterFactory](https://lucene.apache.org/core/10_0_0/analysis/nori/org/apache/lucene/analysis/ko/KoreanReadingFormFilterFactory.html)A token filter that rewrites tokens written in Hanja to their Hangul form.              |
| `korean_number`                       | [KoreanNumberFilterFactory](https://lucene.apache.org/core/10_0_0/analysis/nori/org/apache/lucene/analysis/ko/KoreanNumberFilterFactory.html)A token filter that normalizes Korean numbers to Arabic decimal numbers in half-width characters. |
| `stop`                                | A custom token filter that removes stop words from a token stream.[Custom Stop Words](https://docs.singlestore.com/#section-idm234955753778211.md)                                                                                             |

## Supported Character Filters

This table lists the supported character filters, the name, and a link for the parameters.

| **"char\_filters" (case-sensitive)** | **Lucene Link for Parameters**                                                                                                                                   |
| ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `persian`                            | [PersianCharFilterFactory](https://lucene.apache.org/core/8_8_0//analyzers-common/org/apache/lucene/analysis/fa/PersianCharFilterFactory.html)                   |
| `cjk_width`                          | [CJKWidthCharFilterFactory](https://lucene.apache.org/core/8_8_0//analyzers-common/org/apache/lucene/analysis/cjk/CJKWidthCharFilterFactory.html)                |
| `html_strip`                         | [HTMLStripCharFilterFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/charfilter/HTMLStripCharFilterFactory.html)        |
| `pattern_replace`                    | [PatternReplaceCharFilterFactory](https://lucene.apache.org/core/8_3_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceCharFilterFactory.html) |

***

Modified at: February 11, 2026

Source: [/db/v9.1/developer-resources/functional-extensions/full-text-version-2-custom-analyzers/](https://docs.singlestore.com/db/v9.1/developer-resources/functional-extensions/full-text-version-2-custom-analyzers/)

(An index of the documentation is available at /llms.txt)