Full Text VERSION 2 Custom Analyzers

SingleStore supports custom analyzers for full-text VERSION 2 search. Users can specify an analyzer to get a customized full-text search experience. An analyzer contains three components: a tokenizer, character filters, and token filters. An analyzer must have exactly one tokenizer; it can have zero or more character filters and zero or more token filters. 

A tokenizer takes a stream of characters and breaks that stream into individual tokens, for example split on whitespace characters. A character filter takes the stream of text data and transforms it in a pre-defined way, for example, removing all HTML tags. Token filters receive a stream of tokens and may add, change, or remove tokens, for example, lowercase all tokens or remove stopwords. 

Users can choose from a list of pre-configured analyzers and use them without any modifications. Users can also create their own analyzers by specifying a tokenizer, character filters, and token filters to obtain a fully customized search experience. 

Refer to Working with Full-Text Search for more information on full-text search.

Specify an Analyzer

Specify an analyzer by passing an analyzer configuration in JSON format to INDEX_OPTIONS, which is a JSON string that contains the index configuration. In this JSON, the analyzer key is a string or a nested JSON value.

  • Specify the name of a built-in analyzer (e.g.: standard, cjk, etc.) as a string.

  • Specify a customized built-in analyzer or a custom analyzer as a nested JSON value.

The three examples below show a built-in analyzer with no customizations, a built-in analyzer with a customized set of stopwords, and a custom analyzer. A full set of examples can be found in Examples. Refer to Analyzers for details on analyzers.

Specify the built-in analyzer for Chinese, Japanese, and Korean characters, called the cjk analyzer, with no customizations.

CREATE TABLE t (
title VARCHAR(200),
content VARCHAR(200),
FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
'{ "analyzer": "cjk"}'
);

Specify the built-in cjk analyzer with a customized set of stopwords. Built-in analyzers can be customized with custom stopword lists; no other customizations for built-in analyzers are supported.

CREATE TABLE t (
title VARCHAR(200),
content VARCHAR(200),
FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
'{"analyzer": {
"cjk": {
"stopset": [
"这",
"那"
]
}
}
}'
);

Note

In addition to the cjk analyzer, the Korean nori analyzer is also supported.

Specify a custom analyzer, which uses the whitespace tokenizer, the html_strip character filter, and the lower_case token filter. The analyzer name must be custom. Additional character and token filters can be specified by adding additional char_filters and token_filters key pairs.

CREATE TABLE t (
title VARCHAR(200),
content VARCHAR(200),
FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
'{
"analyzer": {
"custom": {
"tokenizer": "whitespace",
"char_filters": ["html_strip"],
"token_filters": ["lower_case"],
}
}
}'
);

Analyzers

There are two types of analyzers: built-in analyzers and custom analyzers.

Note

The examples in this section show only the INDEX_OPTIONS string (JSON) and omit the rest of the index creation command.

Built-in Analyzers

Built-in analyzers are pre-configured analyzers including the standard analyzer and language-specific analyzers and do not require configuration. Built-in analyzers may be customized with custom stop-word lists.

The default analyzer is the Apache Lucene standard analyzer, which uses the Apache Lucene standard tokenizer, lowercase token filters, and no stopwords.

Specify a built-in analyzer, without customizations, by specifying the name of the analyzer as the value of the analyzer key.

The following example specifies the use of the spanish language analyzer.

INDEX_OPTIONS '{"analyzer" : "spanish"}'

A custom stopword list can be specified for a built-in analyzer by specifying a stopset in the JSON as shown in the following example. A custom stopword list is the only customization supported for built-in analyzers.

The following example specifies a custom stopword list for the standard analyzer.

The value of the analyzer key is a nested JSON value consisting of a key-value pair with key being the name of the analyzer (spanish in this example), and the value being another key-value pair consisting of the key stopset, and the value a JSON array of stop words.

INDEX_OPTIONS '{
"analyzer": {
"spanish": {
"stopset": [
"el",
"la"
]
}
}
}'

SingleStore recommends using the default language analyzer, without stopword customization, in most cases, e.g. '{"analyzer" : "catalan"}'.

Refer to Supported Language Analyzers for links to the default list of stop words for each analyzer.

Custom Analyzers

Create a custom analyzer by using the analyzer name custom and by specifying a tokenizer and optional token and character filters.

A custom analyzer must specify:

  • A required tokenizer - A tokenizer breaks up incoming text into tokens. In many cases, an analyzer will use a tokenizer as the first step in the analysis process. However, to modify text prior to tokenization, use char_filters (see below).

  • An optional array of token_filters: A token_filter modifies tokens that have been created by the tokenizer. Common modifications performed by a token_filter are deletion, stemming, and case folding.

  • An optional array of char_filters: A char_filter transforms the text before it is tokenized, while providing corrected character offsets to account for these modifications.

The example below shows the use of all three components, tokenizer, char_filters, and token_filters.

INDEX_OPTIONS '{
"analyzer" : {
"custom": {
"tokenizer": "whitespace",
"char_filters": ["html_strip"],
"token_filters": ["lower_case"],
}
}
}'

Each of these three components (tokenizer, char_filters, token_filters) can be specified as a string with the name of the component or as a nested JSON with a configuration for the component.

The example below specifies a custom analyzer that uses the whitespace tokenizer, with a maximum length of 256 characters.

INDEX_OPTIONS '{
"analyzer": {
"custom": {
"tokenizer": {
"whitespace": {
"maxTokenLen": 256
}
}
}
}
}'

Common Tokenizers

Common tokenizers that are supported are listed in the table below. Refer to Supported Tokenizers for a full list of supported tokenizers.

"tokenizer" (case-sensitive)

Parameters (includes Lucene Link)

Description (includes Lucene Link)

whitespace

rule (Optional, string). Defaults to "unicode".

maxTokenLen (Optional, integer). Defaults to 256.

WhitespaceTokenizerFactory

Divides text at whitespace characters as defined by Character.isWhitespace(int). This definition excludes non-breaking spaces from whitespace characters.  

WhitespaceTokenizer

standard

maxTokenLength (Optional, integer). Defaults to 255.

StandardTokenizerFactory

Implements Word Break rules from Unicode Text Segmentation: Unicode Standard Annex #29.

StandardTokenizer (Lucene 6.6.0 API)

n_gram

minGramSize (Optional, integer). Defaults to 1.

maxGramSize (Optional, integer). Defaults to 2.

NGramTokenizerFactory

Tokenizes the input into n-grams of the specified size(s).

NGramTokenizer

uax_url_email

maxTokenLength (Optional, integer). Defaults to 255.

UAX29URLEmailTokenizerFactory

Implements Word Break rules from Unicode Text Segmentation: Unicode Standard Annex #29. URLs and email addresses are also tokenized.

UAX29URLEmailTokenizer

Common Token Filters

Common token filters that are supported are listed in the table below. Refer to Supported Token Filters for a full list of supported token filters.

"token_filters" (Case-Sensitive)

Parameters (includes Lucene Link)

Description (includes Lucene Link)

shingle

minShingleSize (Optional, integer). Defaults to 2.

maxShingleSize (Optional, integer). Defaults to 2.

ShingleFilterFactory

Constructs shingles (token n-grams), that is it creates combinations of tokens as a single token.

ShingleFilter

lower_case

No parameters.

LowerCaseFilterFactory

Normalizes token text to lower case.

LowerCaseFilter

snowball_porter

protected (Optional, string). Defaults to "protectedkeyword.txt".

language (Optional, string). Defaults to "English".

SnowballPorterFilterFactory

Stems words using a Snowball-generated stemmer. Available stemmers are listed in org.tartarus.snowball.ext.

SnowballFilter

n_gram

minGramSize (Optional, integer). Defaults to 1.

maxGramSize (Optional, integer). Defaults to 2.

preserveOriginal (Optional, boolean). Defaults to "true".

NGramFilterFactory

Tokenizes the input into n-grams of the given size(s).

NGramTokenFilter

Common Character Filters

Common character filters that are supported are listed in the table below. Refer to Supported Character Filters for a full list of supported character filters.

"char_filters" (case-sensitive)

Parameters (includes Lucene Link)

Description (includes Lucene Link)

html_strip

escapedTags (Optional, string). Defaults to "a, title".

HTMLStripCharFilterFactory

Wraps another Reader and attempts to strip out HTML.

HTMLStripCharFilter

Example of Using Parameters

Specify a default uax_url_email tokenizer:

INDEX_OPTIONS '{
"analyzer": {
"custom": {
"tokenizer": "uax_url_email"
}
}
}'

Specify a uax_url_email tokenizer with custom parameters:

INDEX_OPTIONS '{
"analyzer": {
"custom": {
"tokenizer": {
"uax_url_email" : {
"maxTokenLength": 300
}
}
}
}
}'

Stemming

Stemming transforms words to their root, often by removing suffixes and prefixes. In English the words "dressing" and "dressed", can be stemmed to "dress". Thus, searches for one form of a verb (e.g. "dressing") can return documents containing other forms of the verb (e.g. "dressed" or "dress"). Stemming is language specific. Stemming is handled in JLucene using built-in language analyzers or can be customized using token filters.

NGrams

NGram tokenizers split words into small pieces and are good for fast "fuzzy-style" matching using a full-text index. The minimum and maximum gram length is customizable. Refer to Example 6: Custom Analyzer and N Gram Tokenizer for an example of using a ngram tokenizer.

Examples

Example 1: Custom Analyzer with Whitespace tokenizer

Use a custom analyzer and a whitespace tokenizer to search for text with a hyphen in queries.

Create a table, insert data, and optimize the table to ensure all data is included in results.

CREATE TABLE medium_articles (
title VARCHAR(200),
summary TEXT,
FULLTEXT USING VERSION 2 (summary) INDEX_OPTIONS
'{
"analyzer": {
"custom": {
"tokenizer": "whitespace"
}
}
}'
);
INSERT INTO medium_articles (title, summary) VALUES
('Build Real-Time Multimodal RAG Applications Using SingleStore!','This guide teaches you how to build a multimodal Retrieval-Augmented Generation (RAG) application using SingleStore, integrating various data types for enhanced AI responses.'),
('Building Production-Ready AI Agents with LangGraph: A Real-Life Use Case','This guide offers a solution for creating a scalable, production-ready multi-modal chatbot using LangChain, focusing on dividing tasks for improved control and efficiency.'),
('Scaling RAG from POC to Production','This guide explains Retrieval-Augmented Generation (RAG) for building reliable, context-aware applications using large language models (LLMs) and emphasizes the importance of scaling from proof of concept to production.'),
('Tech Stack For Production-Ready LLM Applications In 2024','This guide reviews preferred tools for the entire LLM app development lifecycle, emphasizing simplicity and ease of use in building scalable AI applications.'),
('LangGraph + Gemini Pro + Custom Tool + Streamlit = Multi-Agent Application Development','This guide teaches you to create a chatbot using LangGraph and Streamlit, leveraging LangChain for building stateful multi-actor applications that respond to user support requests.');
OPTIMIZE TABLE medium_articles FLUSH;

Observe the difference between the results of the two search queries below.

SELECT *
FROM medium_articles
WHERE MATCH(TABLE medium_articles) AGAINST ("summary:multimodal");
+----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| title                                                          | summary                                                                                                                                                                        |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Build Real-Time Multimodal RAG Applications Using SingleStore! | This guide teaches you how to build a multimodal Retrieval-Augmented Generation (RAG) application using SingleStore, integrating various data types for enhanced AI responses. |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
SELECT *
FROM medium_articles
WHERE MATCH(TABLE medium_articles) AGAINST ("summary:multi-modal");
+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| title                                                                    | summary                                                                                                                                                                     |
+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Building Production-Ready AI Agents with LangGraph: A Real-Life Use Case | This guide offers a solution for creating a scalable, production-ready multi-modal chatbot using LangChain, focusing on dividing tasks for improved control and efficiency. |
+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Example 2: Custom Analyzer with Whitespace tokenizer, html_strip as char_filters, and lower_case as token_filters

Use a custom analyzer, a whitespace tokenizer, html_strip as a character filter, and lower_case as a token filter to search for HTML entities in queries.

A character filter receives the original text data and converts it into a predefined format. A token filter receives a stream of tokens and can add, change, or remove tokens as needed.

In this example, html_strip as a character filter removes HTML tags and lower_case as a token filter lowercases the tokens.

Create a table, insert data, and optimize the table to ensure all data is included in results.

Search for HTML entities in queries.

CREATE TABLE html_table (
title VARCHAR(200),
content VARCHAR(200),
FULLTEXT USING VERSION 2 (title, content) INDEX_OPTIONS
'{
"analyzer": {
"custom": {"char_filters": ["html_strip"],
"tokenizer": "whitespace",
"token_filters":["lower_case"]
}
}
}'
);
INSERT INTO html_table (title, content) VALUES
('Exciting News', 'We&apos;re thrilled to announce our new project!</p>'),
('Learning Journey', 'Learning is a never-ending journey &amp; I&apos;m excited!</p>'),
('Success Story', 'Our team has achieved great things &amp; we&apos;re proud!</p>'),
('Grateful Heart', 'Thank you for being a part of our journey &amp; supporting us!</p>'),
('Future Goals', 'We&apos;re looking forward to achieving even more!</p>');
OPTIMIZE TABLE html_table FLUSH;

Search for HTML entity, and observe the result of the search query.

SELECT *
FROM html_table
WHERE match(TABLE html_table) AGAINST("content:we're");
+---------------+----------------------------------------------------------------+
| title         | content                                                        |
+---------------+----------------------------------------------------------------+
| Success Story | Our team has achieved great things &amp; we&apos;re proud!</p> |
| Exciting News | We&apos;re thrilled to announce our new project!</p>           |
| Future Goals  | We&apos;re looking forward to achieving even more!</p>         |
+---------------+----------------------------------------------------------------+

Example 3: Custom Analyzer with standard tokenizer and cjk_width as token_filter

Use a custom analyzer, a standard tokenizer, and cjk_width as a token filter to search for a Japanese text in queries.

In this example, cjk_width as a token filter normalizes the width differences in CJK (Chinese, Japanese, and Korean) characters.

Create a table, insert data and optimize the table to ensure all data is included in results.

CREATE TABLE japanese_novels (
title VARCHAR(200),
content VARCHAR(200),
FULLTEXT USING VERSION 2 (title, content) INDEX_OPTIONS
'{
"analyzer": {
"custom": {"tokenizer": "standard",
"token_filters":["cjk_width"]
}
}
}'
);
INSERT INTO japanese_novels (title, content) VALUES
('ノルウェイの森', '村上春樹の代表作で、愛と喪失をテーマにしています。'),
('吾輩は猫である', '夏目漱石による作品で、猫の視点から人間社会を描いています。'),
('雪国', '川端康成の作品で、美しい雪景色と切ない恋を描いています。'),
('千と千尋の神隠し', '宮崎駿の作品で、少女が異世界で成長する物語です。'),
('コンビニ人間', '村田沙耶香の作品で、現代社会の孤独と適応を描いています。');
OPTIMIZE TABLE japanese_novels FLUSH;

Observe the result of the search query for the Japanese text below.

SELECT *
FROM japanese_novels
WHERE MATCH(TABLE japanese_novels) AGAINST("content: 夏");
+-----------------------+-----------------------------------------------------------------------------------------+
| title                 | content                                                                                 |
+-----------------------+-----------------------------------------------------------------------------------------+
| 吾輩は猫である        | 夏目漱石による作品で、猫の視点から人間社会を描いています。                              |
+-----------------------+-----------------------------------------------------------------------------------------+

Example 4: Korean Analyzer

Use a korean analyzer to search for a Korean text in queries. 

Create a table, insert data and optimize the table to ensure all data is included in results.

CREATE TABLE k_drama (
genre VARCHAR(200),
movie_name TEXT,
cast TEXT,
FULLTEXT USING VERSION 2 (genre) INDEX_OPTIONS
'{
"analyzer": "korean"
}'
);
INSERT INTO k_drama (genre, movie_name, cast)
VALUES
('로맨스', '사랑의 불시착', '현빈, 손예진'),
('액션, 스릴러', '빈센조', '송중기, 전여빈'),
('드라마, 로맨스', '도깨비', '공유, 김고은'),
('사극, 드라마', '미스터 션샤인', '이병헌, 김태리'),
('코미디, 로맨스', '김비서가 왜 그럴까', '박서준, 박민영');
OPTIMIZE TABLE k_drama FLUSH;

Observe the result of the search query for the Korean text below.

SELECT * FROM k_drama WHERE match(TABLE k_drama) AGAINST("genre:로맨스");
+----------------------+----------------------------+----------------------+
| genre                | movie_name                 | cast                 |
+----------------------+----------------------------+----------------------+
| 드라마, 로맨스          | 도깨비                       | 공유, 김고은            |
| 코미디, 로맨스          | 김비서가 왜 그럴까              | 박서준, 박민영          |
| 로맨스                | 사랑의 불시착                  | 현빈, 손예진            |
+----------------------+----------------------------+----------------------+

Example 5: Custom Analyzer, standard tokenizer, Italian language, snowball_porter stemmer, elision filter

Use a custom analyzer, a standard tokenizer, elision and snowball_porter as token filters for the Italian language to search for an Italian text.

In this example, elision as a token filter removes specific elisions from the input token. Using snowball_porter as a token filter stems the words using the Lucene Snowball stemmer tokenization. The snowball_porter token filter requires a language parameter to control the stemmer.

Create a table, insert data and optimize the table to ensure all data is included in results.

CREATE TABLE italian_architecture (
architecture VARCHAR(400),
description VARCHAR(400),
SORT KEY (architecture),
FULLTEXT USING VERSION 2 KEY(description)
INDEX_OPTIONS '{"analyzer" :
{"custom" : {"tokenizer" : "standard",
"token_filters": ["elision",
{"snowball_porter" : {"language": "Italian"}}]}}}}'
);
INSERT INTO italian_architecture (architecture, description) VALUES
('Colosseo', 'Un antico anfiteatro situato a Roma, noto per i combattimenti dei gladiatori.'),
('Torre Pendente di Pisa', 'Un campanile famoso per la sua inclinazione non intenzionale.'),
('Basilica di San Pietro', 'Una chiesa rinascimentale in Vaticano, famosa per la sua cupola progettata da Michelangelo.'),
('Duomo di Milano', 'L’architettura di Milano, nota per la sua straordinaria architettura gotica e le guglie.'),
('Palazzo Ducale', 'Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.');
OPTIMIZE TABLE italian_architecture FLUSH;

Observe the result of the search query for the Italian text below.

SELECT *
FROM italian_architecture
WHERE MATCH(TABLE italian_architecture) AGAINST("description:l’architettura");
+-----------------+--------------------------------------------------------------------------------------------+
| architecture    | description                                                                                |
+-----------------+--------------------------------------------------------------------------------------------+
| Duomo di Milano | L’architettura di Milano, nota per la sua straordinaria architettura gotica e le guglie.   |
| Palazzo Ducale  | Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.        |
+-----------------+--------------------------------------------------------------------------------------------+

Use a custom analyzer, a standard tokenizer, snowball_porter as token filters for the Italian language without elision token filter to search for an Italian text in queries.

Create a second table, insert data and optimize the table to ensure all data is included in results.

CREATE TABLE italian_architecture_2 (
architecture VARCHAR(400),
description VARCHAR(400),
SORT KEY (architecture),
FULLTEXT USING VERSION 2 KEY(description)
INDEX_OPTIONS '{"analyzer" :
{"custom" : {"tokenizer" : "standard",
"token_filters": {"snowball_porter" : {"language": "Italian"}}}}}}'
);
INSERT INTO italian_architecture_2 (architecture, description) VALUES
('Colosseo', 'Un antico anfiteatro situato a Roma, noto per i combattimenti dei gladiatori.'),
('Torre Pendente di Pisa', 'Un campanile famoso per la sua inclinazione non intenzionale.'),
('Basilica di San Pietro', 'Una chiesa rinascimentale in Vaticano, famosa per la sua cupola progettata da Michelangelo.'),
('Duomo di Milano', 'L’architettura di Milano, nota per la sua straordinaria architettura gotica e le guglie.'),
('Palazzo Ducale', 'Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.');
OPTIMIZE TABLE italian_architecture_2 FLUSH;

Observe the result of the search query for the Italian text without elision token filter below.

SELECT *
FROM italian_architecture_2
WHERE MATCH(TABLE italian_architecture_2) AGAINST("description:l’architettura");
+----------------+---------------------------------------------------------------------------------------+
| architecture   | description                                                                           |
+----------------+---------------------------------------------------------------------------------------+
| Palazzo Ducale | Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.   |
+----------------+---------------------------------------------------------------------------------------+

Example 6: Custom Analyzer and N Gram Tokenizer

Use a custom analyzer and a n_gram tokenizer to search for misspelled text in queries.

Create a table, insert data and optimize the table to ensure all data is included in results.

CREATE TABLE university
(name VARCHAR(400),
admission_page VARCHAR(400),
SORT KEY (name),
FULLTEXT USING VERSION 2 KEY(admission_page)
INDEX_OPTIONS '{"analyzer" : {"custom" : {"tokenizer" : "n_gram"}}}}');
INSERT INTO university (name, admission_page) VALUES
('Harvard University', 'college.harvard.edu/admissions'),
('Stanford University', 'stanford.edu/admission/'),
('Massachusetts Institute of Technology (MIT)', 'mitadmissions.org/'),
('California Institute of Technology (Caltech)', 'admissions.caltech.edu/'),
('University of Chicago', 'uchicago.edu/en/admissions');
OPTIMIZE TABLE university FLUSH;

Observe the result of the search query for the misspelled text and compare the search result with the score below.

SELECT name,admission_page, MATCH(TABLE university) AGAINST("admission_page:cattec") AS score
FROM university
WHERE score
ORDER BY score DESC;
+----------------------------------------------+--------------------------------+---------------------+
| name                                         | admission_page                 | score               |
+----------------------------------------------+--------------------------------+---------------------+
| California Institute of Technology (Caltech) | admissions.caltech.edu/        |  2.4422175884246826 |
| University of Chicago                        | uchicago.edu/en/admissions     |  0.8550153970718384 |
| Harvard University                           | college.harvard.edu/admissions |  0.6825864911079407 |
| Stanford University                          | stanford.edu/admission/        |  0.5768249034881592 |
| Massachusetts Institute of Technology (MIT)  | mitadmissions.org/             | 0.26201900839805603 |
+----------------------------------------------+--------------------------------+---------------------+

Example 7: Custom Analyzer with N-Gram Tokenizer, html_strip as char_filters, and lower_case as token_filters

Use a custom analyzer, n_gram tokenizer, html_strip as character filter, and lower_case as token filter to search for HTML entities in queries.

Create a table, insert data and optimize the table to ensure all data is included in results.

CREATE TABLE html_table_n_gram (
title VARCHAR(200),
content VARCHAR(200),
FULLTEXT USING VERSION 2 (title, content) INDEX_OPTIONS
'{
"analyzer": {
"custom": {"char_filters": ["html_strip"],
"tokenizer": "n_gram",
"token_filters":["lower_case"]
}
}
}'
);
INSERT INTO html_table_n_gram (title, content) VALUES
('Exciting News', 'We&apos;re thrilled to announce our new project!</p>'),
('Learning Journey', 'Learning is a never-ending journey &amp; I&apos;m excited!</p>'),
('Success Story', 'Our team has achieved great things &amp; we&apos;re proud!</p>'),
('Grateful Heart', 'Thank you for being a part of our journey &amp; supporting us!</p>'),
('Future Goals', 'We&apos;re looking forward to achieving even more!</p>');
OPTIMIZE TABLE html_table_n_gram FLUSH;

Observe the result of the search query for the misspelled HTML entity and compare the search result with the score below.

SELECT title,content, MATCH(TABLE html_table_n_gram) AGAINST("content:I',") AS score
FROM html_table_n_gram
WHERE score
ORDER BY score DESC;
+------------------+--------------------------------------------------------------------+---------------------+
| title            | content                                                            | score               |
+------------------+--------------------------------------------------------------------+---------------------+
| Learning Journey | Learning is a never-ending journey &amp; I&apos;m excited!</p>     |  0.5430432558059692 |
| Success Story    | Our team has achieved great things &amp; we&apos;re proud!</p>     | 0.31375283002853394 |
| Exciting News    | We&apos;re thrilled to announce our new project!</p>               | 0.26527124643325806 |
| Future Goals     | We&apos;re looking forward to achieving even more!</p>             |  0.2177681028842926 |
| Grateful Heart   | Thank you for being a part of our journey &amp; supporting us!</p> |  0.1819886565208435 |
+------------------+--------------------------------------------------------------------+---------------------+

Example 8: Portuguese Analyzer with score

Use a portuguese analyzer to search for a Portuguese text in queries.

Create a table, insert data and optimize the table to ensure all data is included in results.

CREATE TABLE portuguese_news (
headline VARCHAR(200),
content TEXT,
FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
'{
"analyzer": "portuguese"
}'
);
INSERT INTO portuguese_news (headline, content) VALUES
('Cenário Econômico Brasileiro', 'O Brasil enfrenta desafios econômicos com a inflação em alta e a taxa de desemprego ainda elevada.'),
('Mercado de Ações em Alta', 'As ações brasileiras registraram ganhos significativos, impulsionadas por resultados financeiros positivos de grandes empresas.'),
('Nova Política Monetária do Banco Central', 'O Banco Central do Brasil anunciou mudanças na política monetária para conter a inflação e estimular o crescimento econômico.'),
('Investimentos Estrangeiros no Brasil', 'O país atraiu um aumento de investimentos estrangeiros diretos, especialmente em setores de tecnologia e energia renovável.'),
('Tendências do Mercado Imobiliário', 'O mercado imobiliário brasileiro mostra sinais de recuperação, com aumento nas vendas de imóveis e novos lançamentos.');
OPTIMIZE TABLE portuguese_news FLUSH;

Observe the result of the search query for the Portuguese text and compare the search result with the score below.

SELECT content, MATCH(TABLE portuguese_news) AGAINST ("content:Brasil") AS score
FROM portuguese_news
WHERE score
ORDER BY score DESC;
+-------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| content                                                                                                                             | score               |
+-------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| O Brasil enfrenta desafios econômicos com a inflação em alta e a taxa de desemprego ainda elevada.                                  | 0.22189012169837952 |
| O Banco Central do Brasil anunciou mudanças na política monetária para conter a inflação e estimular o crescimento econômico.       |  0.2059776782989502 |
+-------------------------------------------------------------------------------------------------------------------------------------+---------------------+

Example 9: Spanish Analyzer with custom stopwords

Use a spanish analyzer with custom stopwords to search for a Spanish text in queries.

Create a table, insert data and optimize the table to ensure all data is included in results.

CREATE TABLE spanish_news (
headline VARCHAR(200),
content TEXT,
FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS
'{
"analyzer": {"spanish": {"stopset": ["descubrimiento", "tratamiento", "nuevo"]}}
}'
);
INSERT INTO spanish_news (headline, content) VALUES
('Descubrimiento de un nuevo tratamiento para la diabetes', 'Investigadores han desarrollado un tratamiento innovador que mejora el control del azúcar en sangre en pacientes diabéticos.'),
('Avances en la detección temprana del cáncer', 'Un nuevo método permite detectar el cáncer en etapas más tempranas, aumentando las posibilidades de tratamiento exitoso.'),
('Nuevo enfoque para tratar enfermedades cardíacas', 'Se ha introducido un nuevo enfoque terapéutico que reduce significativamente el riesgo de ataques cardíacos.'),
('Investigación sobre un gen relacionado con el Alzheimer', 'Científicos han identificado un gen que podría estar vinculado a la enfermedad de Alzheimer, lo que abre nuevas posibilidades para el tratamiento.'),
('Desarrollo de una vacuna contra COVID-19', 'Un equipo de investigadores ha anunciado resultados prometedores en la efectividad de una nueva vacuna contra COVID-19.');
OPTIMIZE TABLE spanish_news FLUSH;

Observe the results of two search queries below: one for the defined Spanish stopword and another for the actual Spanish stopword. The defined stopwords above overwrite the actual stopwords.

SELECT *
FROM spanish_news
WHERE MATCH(TABLE spanish_news) AGAINST("content:nuevo");
Empty set 
SELECT *
FROM spanish_news
WHERE MATCH(TABLE spanish_news) AGAINST("content:el");
+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| headline                                                 | content                                                                                                                                              |
+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Investigación sobre un gen relacionado con el Alzheimer  | Científicos han identificado un gen que podría estar vinculado a la enfermedad de Alzheimer, lo que abre nuevas posibilidades para el tratamiento.   |
| Avances en la detección temprana del cáncer              | Un nuevo método permite detectar el cáncer en etapas más tempranas, aumentando las posibilidades de tratamiento exitoso.                             |
| Descubrimiento de un nuevo tratamiento para la diabetes  | Investigadores han desarrollado un tratamiento innovador que mejora el control del azúcar en sangre en pacientes diabéticos.                         |
| Nuevo enfoque para tratar enfermedades cardíacas         | Se ha introducido un nuevo enfoque terapéutico que reduce significativamente el riesgo de ataques cardíacos.                                         |
+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+

Supported Language Analyzers

The following table lists the supported language analyzers.

Language

Default Stop Word List Link

arabic

Apache Lucene Arabic Stopwords

bulgarian

Apache Lucene Bulgarian Stopwords

bengali

Apache Lucene Bengali Stopwords

brazilian_portuguese

Apache Lucene Brazilian, Portuguese Stopwords

catalan

Apache Lucene Catalan Stopwords

cjk

Apache Lucene CJK Stopwords

sorani_kurdish

Apache Lucene Sorani, Kurdish Stopwords

czech

Apache Lucene Czech Stopwords

danish

Apache Lucene Danish Stopwords

german

Apache Lucene German Stopwords

greek

Apache Lucene Greek Stopwords

english

"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"

spanish

Apache Lucene Spanish Stopwords

estonian

Apache Lucene Estonian Stopwords

basque

Apache Lucene Basque Stopwords

persian

Apache Lucene Persian Stopwords

finnish

Apache Lucene Finnish Stopwords

french

Apache Lucene French Stopwords

irish

Apache Lucene Irish Stopwords

galician

Apache Lucene Galician Stopwords

hindi

Apache Lucene Hindi Stopwords

hungarian

Apache Lucene Hungarian Stopwords

armenian

Apache Lucene Armenian Stopwords

indonesian

Apache Lucene Indonesian Stopwords

italian

Apache Lucene Italian Stopwords

korean

This is Apache Lucene's Korean (Nori) Analyzer. Filters tokens based on part-of-speech tags: EF, EC, ETN, ETM, IC, JKS, JKC, JKG, JKO, JKB, JKV, JKQ, JX, JC, MAG, MAJ, MM, SP, SSC, SSO, SC, SE, XPN, XSA, XSN, XSV, UNA, NA, VSV.

Part of speech tags.

Custom stopword lists are not supported with the korean analyzer

lithuanian

Apache Lucene Lithuanian Stopwords

latvian

Apache Lucene Latvian Stopwords

nepali

Apache Lucene Nepali Stopwords

dutch

Apache Lucene Dutch Stopwords

norwegian

Apache Lucene Norwegian Stopwords

portuguese

Apache Lucene Portuguese Stopwords

romanian

Apache Lucene Romanian Stopwords

russian

Apache Lucene Russian Stopwords

serbian

Apache Lucene Serbian Stopwords

swedish

Apache Lucene Swedish Stopwords

tamil

Apache Lucene Tamil Stopwords

telegu

Apache Lucene Telegu Stopwords

thai

Apache Lucene Thai Stopwords

turkish

Apache Lucene Turkish Stopwords

Supported Tokenizers

The table below lists supported tokenizers. These tokenizers may have custom parameters, which can be obtained and used as described below.

Get Parameters

The parameters and description of each of these tokenizers can be obtained from the links included in the table.

Example: Get parameters for the uax_url_email tokenizer

To obtain the parameters for the uax_url_email tokenizer, follow the tokenizer factory link for the uax_url_email tokenizer, which can be found in the middle column of the table below.

The following is the tokenizer factory from the uax_url_email tokenizer, which has been obtained from the tokenizer factory link. This tokenizer has one parameter maxTokenLength, which defaults to 255.

<fieldType name="text_urlemail" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.UAX29URLEmailTokenizerFactory" maxTokenLength="255"/>
</analyzer>
</fieldType>

The INDEX_OPTIONS string to create a full-text index with the uax_url_email tokenizer specifying a maxTokenLength of 300 is shown below.

INDEX_OPTIONS '{
"analyzer": {
"custom": {
"tokenizer": {
"uax_url_email" : {
"maxTokenLength": 300
}
}
}
}
}'

List of Supported Tokennizers

"tokenizer" (Case-Sensitive)

Tokenizer Factory Link (Includes Parameters)

Tokenizer Class Link (Includes Description)

uax_url_email

UAX29URLEmailTokenizerFactory

UAX29URLEmailTokenizer

whitespace

WhitespaceTokenizerFactory

WhitespaceTokenizer

classic

ClassicTokenizerFactory

ClassicTokenizer

simple_pattern

SimplePatternTokenizerFactory

SimplePatternTokenizer

standard

StandardTokenizerFactory

StandardTokenizer

keyword

KeywordTokenizerFactory

KeywordTokenizer

letter

LetterTokenizerFactory

LetterTokenizer

simple_pattern_split

SimplePatternSplitTokenizerFactory

SimplePatternSplitTokenizer

pattern

PatternTokenizerFactory

PatternTokenizer

thai

ThaiTokenizerFactory

ThaiTokenizer

edge_n_gram

EdgeNGramTokenizerFactory

EdgeNGramTokenizer

n_gram

NGramTokenizerFactory

NGramTokenizer

wikipedia

WikipediaTokenizerFactory

WikipediaTokenizer

path_hierarchy

PathHierarchyTokenizerFactory

PathHierarchyTokenizer

korean

Description: Tokenizer for Korean that uses morphological analysis.

Supports the following attributes:

  • userDictionary (JSON array of strings): A JSON array of strings; each string is a term in the dictionary.

  • decompoundMode (JSON string): determines how the tokenizer handles POS.Type.COMPOUND, POS.Type.INFLECT, and POS.Type.PREANALYSIS tokens. Values can be 'none', 'discard', 'mixed', the default is 'discard'.

  • outputUnknownUnigrams (JSON boolean value): If "true" outputs unigrams for unknown words.

  • discardPunctuation (JSON boolean value): If "true", punctuation tokens are dropped from the output.

Supported Token Filters

This table lists the supported token filters, the name, and a link for the token filter factory documentation which provides parameters and description for the tokenizer. Refer to the previous section for an example of obtaining and using parameters.

"token_filters" (Case-Sensitive)

Lucene Link for Parameters and Description

russian_light_stem

RussianLightStemFilterFactory

scandinavian_normalization

ScandinavialnNormalizationFilterFactory

decimal_digit

DecimalDigitFilterFactory

ascii_folding

ASCIIFoldingFilterFactory

german_stem

GermanStemFilterFactory

bulgarian_stem

BulgarianStemFilterFactory

codepoint_count

CodepointCountFilterFactory

pattern_replace

PatternReplaceFilterFactory

persian_normalization

PersianNormalizationFilterFactory

limit_token_position

LimitTokenPositionFilterFactory

porter_stem

PorterStemFilterFactory

greek_stem

GreekStemFilterFactory

finnish_light_stem

FinnishLightStemFilterFactory

fingerprint

FingerprintFilterFactory

cjk_width

CJKWidthFilterFactory

reverse_string

ReverseStringFilterFactory

common_grams

CommonGramsFilterFactory

delimited_boost_token

DelimitedBoostTokenFilterFactory

scandinavian_folding

ScandinavianFoldingFilterFactory

hindi_stem

HindiStemFilterFactory

spanish_plural_stem

SpanishPluralStemFilterFactory

indonesian_stem

IndonesianStemFilterFactory

trim

TrimFilterFactory

french_light_stem

FrenchLightStemFilterFactory

classic

ClassicFilterFactory

fixed_shingle

FixedShingleFilterFactory

english_possessive

EnglishPossessiveFilterFactory

german_normalization

GermanNormalizationFilterFactory

keyword_repeat

KeywordRepeatFilterFactory

min_hash

MinHashFilterFactory

remove_duplicates_token

RemoveDuplicatesTokenFilterFactory

snowball_porter

SnowballPorterFilterFactory

german_minimal_stem

GermanMinimalStemFilterFactory

norwegian_light_stem

NorwegianLightStemFilterFactory

english_minimal_stem

EnglishMinimalStemFilterFactory

norwegian_minimal_stem

NorwegianMinimalStemFilterFactory

czech_stem

CzechStemFilterFactory

sorani_stem

SoraniStemFilterFactory

limit_token_offset

LimitTokenOffsetFilterFactory

persian_stem

PersianStemFilterFactory

common_grams_query

CommonGramsQueryFilterFactory

sorani_normalization

SoraniNormalizationFilterFactory

swedish_light_stem

SwedishLightStemFilterFactory

k_stem

KStemFilterFactory

french_minimal_stem

FrenchMinimalStemFilterFactory

hyphenated_words

HyphenatedWordsFilterFactory

capitalization

CapitalizationFilterFactory

lower_case

LowerCaseFilterFactory

hungarian_light_stem

HungarianLightStemFilterFactory

telugu_stem

SynonymGraphFilterFactory

italian_light_stem

ItalianLightStemFilterFactory

limit_token_count

LimitTokenCountFilterFactory

swedish_minimal_stem

SwedishLightStemFilterFactory

galician_minimal_stem

GalicianMinimalStemFilterFactory

portuguese_minimal_stem

PortugueseMinimalStemFilterFactory

bengali_normalization

BengaliNormalizationFilterFactory

galician_stem

GalicianStemFilterFactory

turkish_lower_case

TurkishLowerCaseFilterFactory

bengali_stem

BengaliStemFilterFactory

indic_normalization

IndicNormalizationFilterFactory

keep_word

KeepWordFilterFactory

drop_if_flagged

DictionaryCompoundWordTokenFilterFactory

latvian_stem

LatvianStemFilterFactory

portuguese_light_stem

PortugueseLightStemFilterFactory

apostrophe

ApostropheFilterFactory

arabic_stem

ArabicStemFilterFactory

delimited_term_frequency_token

DelimitedTermFrequencyTokenFilterFactory

irish_lower_case

IrishLowerCaseFilterFactory

edge_n_gram

EdgeNGramFilterFactory

german_light_stem

GermanLightStemFilterFactory

pattern_capture_group

PatternCaptureGroupFilterFactory

spanish_light_stem

SpanishLightStemFilterFactory

hindi_normalization

HindiNormalizationFilterFactory

norwegian_normalization

NorwegianNormalizationFilterFactory

shingle

ShingleFilterFactory

telugu_normalization

SynonymGraphFilterFactory

date_recognizer

DateRecognizerFilterFactory

n_gram

NGramFilterFactory

upper_case

UpperCaseFilterFactory

brazilian_stem

BrazilianStemFilterFactory

cjk_bigram

CJKBigramFilterFactory

truncate_token

TruncateTokenFilterFactory

greek_lower_case

GreekLowerCaseFilterFactory

length

LengthFilterFactory

arabic_normalization

ArabicNormalizationFilterFactory

portuguese_stem

PortugueseStemFilterFactory

elision

ElisionFilterFactory

korean_part_of_speech

KoreanPartOfSpeechStopFilterFactory

A token filter that removes tokens that match a set of part-of-speech tags

korean_reading_form

KoreanReadingFormFilterFactory

A token filter that rewrites tokens written in Hanja to their Hangul form.

korean_number

KoreanNumberFilterFactory

A token filter that normalizes Korean numbers to Arabic decimal numbers in half-width characters.

Supported Character Filters

This table lists the supported character filters, the name, and a link for the parameters.

"char_filters" (case-sensitive)

Lucene Link for Parameters

persian

PersianCharFilterFactory

cjk_width

CJKWidthCharFilterFactory

html_strip

HTMLStripCharFilterFactory

pattern_replace

PatternReplaceCharFilterFactory

Last modified: January 17, 2025

Was this article helpful?