Full Text VERSION 2 Custom Analyzers
On this page
SingleStore supports custom analyzers for full-text VERSION 2 search.
A tokenizer takes a stream of characters and breaks that stream into individual tokens, for example split on whitespace characters.
Users can choose from a list of pre-configured analyzers and use them without any modifications.
Refer to Working with Full-Text Search for more information on full-text search.
Specify an Analyzer
Specify an analyzer by passing an analyzer configuration in JSON format to INDEX_
, which is a JSON string that contains the index configuration.
-
Specify the name of a built-in analyzer (e.
g. : standard
,cjk
, etc.) as a string. -
Specify a customized built-in analyzer or a custom analyzer as a nested JSON value.
The three examples below show a built-in analyzer with no customizations, a built-in analyzer with a customized set of stopwords, and a custom analyzer.
Specify the built-in analyzer for Chinese, Japanese, and Korean characters, called the cjk
analyzer, with no customizations.
CREATE TABLE t (title VARCHAR(200),content VARCHAR(200),FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS'{ "analyzer": "cjk"}');
Specify the built-in cjk
analyzer with a customized set of stopwords.
CREATE TABLE t (title VARCHAR(200),content VARCHAR(200),FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS'{"analyzer": {"cjk": {"stopset": ["这","那"]}}}');
Note
In addition to the cjk
analyzer, the Korean nori
analyzer is also supported.
Specify a custom analyzer, which uses the whitespace
tokenizer, the html_
character filter, and the lower_
token filter.custom
.char_
and token_
key pairs.
CREATE TABLE t (title VARCHAR(200),content VARCHAR(200),FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS'{"analyzer": {"custom": {"tokenizer": "whitespace","char_filters": ["html_strip"],"token_filters": ["lower_case"],}}}');
Analyzers
There are two types of analyzers: built-in analyzers and custom analyzers.
Note
The examples in this section show only the INDEX_
string (JSON) and omit the rest of the index creation command.
Built-in Analyzers
Built-in analyzers are pre-configured analyzers including the standard analyzer and language-specific analyzers and do not require configuration.
The default analyzer is the Apache Lucene standard analyzer, which uses the Apache Lucene standard tokenizer, lowercase token filters, and no stopwords.
Specify a built-in analyzer, without customizations, by specifying the name of the analyzer as the value of the analyzer
key.
The following example specifies the use of the spanish
language analyzer.
INDEX_OPTIONS '{"analyzer" : "spanish"}'
A custom stopword list can be specified for a built-in analyzer by specifying a stopset
in the JSON as shown in the following example.
The following example specifies a custom stopword list for the standard analyzer.
The value of the analyzer
key is a nested JSON value consisting of a key-value pair with key being the name of the analyzer (spanish
in this example), and the value being another key-value pair consisting of the key stopset
, and the value a JSON array of stop words.
INDEX_OPTIONS '{"analyzer": {"spanish": {"stopset": ["el","la"]}}}'
SingleStore recommends using the default language analyzer, without stopword customization, in most cases, e.'{"analyzer" : "catalan"}'
.
Refer to Supported Language Analyzers for links to the default list of stop words for each analyzer.
Custom Analyzers
Create a custom analyzer by using the analyzer name custom
and by specifying a tokenizer and optional token and character filters.
A custom analyzer must specify:
-
A required
tokenizer
- A tokenizer breaks up incoming text into tokens.In many cases, an analyzer will use a tokenizer as the first step in the analysis process. However, to modify text prior to tokenization, use char_
(see below).filters -
An optional array of
token_
: Afilters token_
modifies tokens that have been created by the tokenizer.filter Common modifications performed by a token_
are deletion, stemming, and case folding.filter -
An optional array of
char_
: Afilters char_
transforms the text before it is tokenized, while providing corrected character offsets to account for these modifications.filter
The example below shows the use of all three components, tokenizer
, char_
, and token_
.
INDEX_OPTIONS '{"analyzer" : {"custom": {"tokenizer": "whitespace","char_filters": ["html_strip"],"token_filters": ["lower_case"],}}}'
Each of these three components (tokenizer
, char_
, token_
) can be specified as a string with the name of the component or as a nested JSON with a configuration for the component.
The example below specifies a custom analyzer that uses the whitespace
tokenizer, with a maximum length of 256 characters.
INDEX_OPTIONS '{"analyzer": {"custom": {"tokenizer": {"whitespace": {"maxTokenLen": 256}}}}}'
Common Tokenizers
Common tokenizers that are supported are listed in the table below.
"tokenizer" (case-sensitive) |
Parameters (includes Lucene Link) |
Description (includes Lucene Link) |
---|---|---|
|
|
Divides text at whitespace characters as defined by Character. |
|
|
Implements Word Break rules from Unicode Text Segmentation: Unicode Standard Annex #29. |
|
|
Tokenizes the input into n-grams of the specified size(s). |
|
|
Implements Word Break rules from Unicode Text Segmentation: Unicode Standard Annex #29. |
Common Token Filters
Common token filters that are supported are listed in the table below.
"token_ |
Parameters (includes Lucene Link) |
Description (includes Lucene Link) |
---|---|---|
|
|
Constructs shingles (token n-grams), that is it creates combinations of tokens as a single token. |
|
No parameters. |
Normalizes token text to lower case. |
|
|
Stems words using a Snowball-generated stemmer. |
|
|
Tokenizes the input into n-grams of the given size(s). |
Common Character Filters
Common character filters that are supported are listed in the table below.
"char_ |
Parameters (includes Lucene Link) |
Description (includes Lucene Link) |
---|---|---|
|
|
Wraps another Reader and attempts to strip out HTML. |
Example of Using Parameters
Specify a default uax_
tokenizer:
INDEX_OPTIONS '{"analyzer": {"custom": {"tokenizer": "uax_url_email"}}}'
Specify a uax_
tokenizer with custom parameters:
INDEX_OPTIONS '{"analyzer": {"custom": {"tokenizer": {"uax_url_email" : {"maxTokenLength": 300}}}}}'
Stemming
Stemming transforms words to their root, often by removing suffixes and prefixes.
NGrams
NGram tokenizers split words into small pieces and are good for fast "fuzzy-style" matching using a full-text index.
Examples
Example 1: Custom Analyzer with Whitespace tokenizer
Use a custom
analyzer and a whitespace
tokenizer to search for text with a hyphen in queries.
Create a table, insert data, and optimize the table to ensure all data is included in results.
CREATE TABLE medium_articles (title VARCHAR(200),summary TEXT,FULLTEXT USING VERSION 2 (summary) INDEX_OPTIONS'{"analyzer": {"custom": {"tokenizer": "whitespace"}}}');INSERT INTO medium_articles (title, summary) VALUES('Build Real-Time Multimodal RAG Applications Using SingleStore!','This guide teaches you how to build a multimodal Retrieval-Augmented Generation (RAG) application using SingleStore, integrating various data types for enhanced AI responses.'),('Building Production-Ready AI Agents with LangGraph: A Real-Life Use Case','This guide offers a solution for creating a scalable, production-ready multi-modal chatbot using LangChain, focusing on dividing tasks for improved control and efficiency.'),('Scaling RAG from POC to Production','This guide explains Retrieval-Augmented Generation (RAG) for building reliable, context-aware applications using large language models (LLMs) and emphasizes the importance of scaling from proof of concept to production.'),('Tech Stack For Production-Ready LLM Applications In 2024','This guide reviews preferred tools for the entire LLM app development lifecycle, emphasizing simplicity and ease of use in building scalable AI applications.'),('LangGraph + Gemini Pro + Custom Tool + Streamlit = Multi-Agent Application Development','This guide teaches you to create a chatbot using LangGraph and Streamlit, leveraging LangChain for building stateful multi-actor applications that respond to user support requests.');OPTIMIZE TABLE medium_articles FLUSH;
Observe the difference between the results of the two search queries below.
SELECT *FROM medium_articlesWHERE MATCH(TABLE medium_articles) AGAINST ("summary:multimodal");
+----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| title | summary |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Build Real-Time Multimodal RAG Applications Using SingleStore! | This guide teaches you how to build a multimodal Retrieval-Augmented Generation (RAG) application using SingleStore, integrating various data types for enhanced AI responses. |
+----------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
SELECT *FROM medium_articlesWHERE MATCH(TABLE medium_articles) AGAINST ("summary:multi-modal");
+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| title | summary |
+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Building Production-Ready AI Agents with LangGraph: A Real-Life Use Case | This guide offers a solution for creating a scalable, production-ready multi-modal chatbot using LangChain, focusing on dividing tasks for improved control and efficiency. |
+--------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Example 2: Custom Analyzer with Whitespace tokenizer, html_ strip as char_ filters, and lower_ case as token_ filters
Use a custom
analyzer, a whitespace
tokenizer, html_
as a character filter, and lower_
as a token filter to search for HTML entities in queries.
A character filter receives the original text data and converts it into a predefined format.
In this example, html_
as a character filter removes HTML tags and lower_
as a token filter lowercases the tokens.
Create a table, insert data, and optimize the table to ensure all data is included in results.
Search for HTML entities in queries.
CREATE TABLE html_table (title VARCHAR(200),content VARCHAR(200),FULLTEXT USING VERSION 2 (title, content) INDEX_OPTIONS'{"analyzer": {"custom": {"char_filters": ["html_strip"],"tokenizer": "whitespace","token_filters":["lower_case"]}}}');INSERT INTO html_table (title, content) VALUES('Exciting News', 'We're thrilled to announce our new project!</p>'),('Learning Journey', 'Learning is a never-ending journey & I'm excited!</p>'),('Success Story', 'Our team has achieved great things & we're proud!</p>'),('Grateful Heart', 'Thank you for being a part of our journey & supporting us!</p>'),('Future Goals', 'We're looking forward to achieving even more!</p>');OPTIMIZE TABLE html_table FLUSH;
Search for HTML entity, and observe the result of the search query.
SELECT *FROM html_tableWHERE match(TABLE html_table) AGAINST("content:we're");
+---------------+----------------------------------------------------------------+
| title | content |
+---------------+----------------------------------------------------------------+
| Success Story | Our team has achieved great things & we're proud!</p> |
| Exciting News | We're thrilled to announce our new project!</p> |
| Future Goals | We're looking forward to achieving even more!</p> |
+---------------+----------------------------------------------------------------+
Example 3: Custom Analyzer with standard tokenizer and cjk_ width as token_ filter
Use a custom
analyzer, a standard
tokenizer, and cjk_
as a token filter to search for a Japanese text in queries.
In this example, cjk_
as a token filter normalizes the width differences in CJK (Chinese, Japanese, and Korean) characters.
Create a table, insert data and optimize the table to ensure all data is included in results.
CREATE TABLE japanese_novels (title VARCHAR(200),content VARCHAR(200),FULLTEXT USING VERSION 2 (title, content) INDEX_OPTIONS'{"analyzer": {"custom": {"tokenizer": "standard","token_filters":["cjk_width"]}}}');INSERT INTO japanese_novels (title, content) VALUES('ノルウェイの森', '村上春樹の代表作で、愛と喪失をテーマにしています。'),('吾輩は猫である', '夏目漱石による作品で、猫の視点から人間社会を描いています。'),('雪国', '川端康成の作品で、美しい雪景色と切ない恋を描いています。'),('千と千尋の神隠し', '宮崎駿の作品で、少女が異世界で成長する物語です。'),('コンビニ人間', '村田沙耶香の作品で、現代社会の孤独と適応を描いています。');OPTIMIZE TABLE japanese_novels FLUSH;
Observe the result of the search query for the Japanese text below.
SELECT *FROM japanese_novelsWHERE MATCH(TABLE japanese_novels) AGAINST("content: 夏");
+-----------------------+-----------------------------------------------------------------------------------------+
| title | content |
+-----------------------+-----------------------------------------------------------------------------------------+
| 吾輩は猫である | 夏目漱石による作品で、猫の視点から人間社会を描いています。 |
+-----------------------+-----------------------------------------------------------------------------------------+
Example 4: Korean Analyzer
Use a korean
analyzer to search for a Korean text in queries.
Create a table, insert data and optimize the table to ensure all data is included in results.
CREATE TABLE k_drama (genre VARCHAR(200),movie_name TEXT,cast TEXT,FULLTEXT USING VERSION 2 (genre) INDEX_OPTIONS'{"analyzer": "korean"}');INSERT INTO k_drama (genre, movie_name, cast)VALUES('로맨스', '사랑의 불시착', '현빈, 손예진'),('액션, 스릴러', '빈센조', '송중기, 전여빈'),('드라마, 로맨스', '도깨비', '공유, 김고은'),('사극, 드라마', '미스터 션샤인', '이병헌, 김태리'),('코미디, 로맨스', '김비서가 왜 그럴까', '박서준, 박민영');OPTIMIZE TABLE k_drama FLUSH;
Observe the result of the search query for the Korean text below.
SELECT * FROM k_drama WHERE match(TABLE k_drama) AGAINST("genre:로맨스");
+----------------------+----------------------------+----------------------+
| genre | movie_name | cast |
+----------------------+----------------------------+----------------------+
| 드라마, 로맨스 | 도깨비 | 공유, 김고은 |
| 코미디, 로맨스 | 김비서가 왜 그럴까 | 박서준, 박민영 |
| 로맨스 | 사랑의 불시착 | 현빈, 손예진 |
+----------------------+----------------------------+----------------------+
Example 5: Custom Analyzer, standard tokenizer, Italian language, snowball_ porter stemmer, elision filter
Use a custom
analyzer, a standard
tokenizer, elision
and snowball_
as token filters for the Italian
language to search for an Italian text.
In this example, elision
as a token filter removes specific elisions from the input token.snowball_
as a token filter stems the words using the Lucene Snowball stemmer tokenization.snowball_
token filter requires a language parameter to control the stemmer.
Create a table, insert data and optimize the table to ensure all data is included in results.
CREATE TABLE italian_architecture (architecture VARCHAR(400),description VARCHAR(400),SORT KEY (architecture),FULLTEXT USING VERSION 2 KEY(description)INDEX_OPTIONS '{"analyzer" :{"custom" : {"tokenizer" : "standard","token_filters": ["elision",{"snowball_porter" : {"language": "Italian"}}]}}}}');INSERT INTO italian_architecture (architecture, description) VALUES('Colosseo', 'Un antico anfiteatro situato a Roma, noto per i combattimenti dei gladiatori.'),('Torre Pendente di Pisa', 'Un campanile famoso per la sua inclinazione non intenzionale.'),('Basilica di San Pietro', 'Una chiesa rinascimentale in Vaticano, famosa per la sua cupola progettata da Michelangelo.'),('Duomo di Milano', 'L’architettura di Milano, nota per la sua straordinaria architettura gotica e le guglie.'),('Palazzo Ducale', 'Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.');OPTIMIZE TABLE italian_architecture FLUSH;
Observe the result of the search query for the Italian text below.
SELECT *FROM italian_architectureWHERE MATCH(TABLE italian_architecture) AGAINST("description:l’architettura");
+-----------------+--------------------------------------------------------------------------------------------+
| architecture | description |
+-----------------+--------------------------------------------------------------------------------------------+
| Duomo di Milano | L’architettura di Milano, nota per la sua straordinaria architettura gotica e le guglie. |
| Palazzo Ducale | Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia. |
+-----------------+--------------------------------------------------------------------------------------------+
Use a custom
analyzer, a standard
tokenizer, snowball_
as token filters for the Italian
language without elision token filter to search for an Italian text in queries.
Create a second table, insert data and optimize the table to ensure all data is included in results.
CREATE TABLE italian_architecture_2 (architecture VARCHAR(400),description VARCHAR(400),SORT KEY (architecture),FULLTEXT USING VERSION 2 KEY(description)INDEX_OPTIONS '{"analyzer" :{"custom" : {"tokenizer" : "standard","token_filters": {"snowball_porter" : {"language": "Italian"}}}}}}');INSERT INTO italian_architecture_2 (architecture, description) VALUES('Colosseo', 'Un antico anfiteatro situato a Roma, noto per i combattimenti dei gladiatori.'),('Torre Pendente di Pisa', 'Un campanile famoso per la sua inclinazione non intenzionale.'),('Basilica di San Pietro', 'Una chiesa rinascimentale in Vaticano, famosa per la sua cupola progettata da Michelangelo.'),('Duomo di Milano', 'L’architettura di Milano, nota per la sua straordinaria architettura gotica e le guglie.'),('Palazzo Ducale', 'Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia.');OPTIMIZE TABLE italian_architecture_2 FLUSH;
Observe the result of the search query for the Italian text without elision token filter below.
SELECT *FROM italian_architecture_2WHERE MATCH(TABLE italian_architecture_2) AGAINST("description:l’architettura");
+----------------+---------------------------------------------------------------------------------------+
| architecture | description |
+----------------+---------------------------------------------------------------------------------------+
| Palazzo Ducale | Il Palazzo dei Dogi a Venezia, che mostra l’architettura gotica e una ricca storia. |
+----------------+---------------------------------------------------------------------------------------+
Example 6: Custom Analyzer and N Gram Tokenizer
Use a custom
analyzer and a n_
tokenizer to search for misspelled text in queries.
Create a table, insert data and optimize the table to ensure all data is included in results.
CREATE TABLE university(name VARCHAR(400),admission_page VARCHAR(400),SORT KEY (name),FULLTEXT USING VERSION 2 KEY(admission_page)INDEX_OPTIONS '{"analyzer" : {"custom" : {"tokenizer" : "n_gram"}}}}');INSERT INTO university (name, admission_page) VALUES('Harvard University', 'college.harvard.edu/admissions'),('Stanford University', 'stanford.edu/admission/'),('Massachusetts Institute of Technology (MIT)', 'mitadmissions.org/'),('California Institute of Technology (Caltech)', 'admissions.caltech.edu/'),('University of Chicago', 'uchicago.edu/en/admissions');OPTIMIZE TABLE university FLUSH;
Observe the result of the search query for the misspelled text and compare the search result with the score below.
SELECT name,admission_page, MATCH(TABLE university) AGAINST("admission_page:cattec") AS scoreFROM universityWHERE scoreORDER BY score DESC;
+----------------------------------------------+--------------------------------+---------------------+
| name | admission_page | score |
+----------------------------------------------+--------------------------------+---------------------+
| California Institute of Technology (Caltech) | admissions.caltech.edu/ | 2.4422175884246826 |
| University of Chicago | uchicago.edu/en/admissions | 0.8550153970718384 |
| Harvard University | college.harvard.edu/admissions | 0.6825864911079407 |
| Stanford University | stanford.edu/admission/ | 0.5768249034881592 |
| Massachusetts Institute of Technology (MIT) | mitadmissions.org/ | 0.26201900839805603 |
+----------------------------------------------+--------------------------------+---------------------+
Example 7: Custom Analyzer with N-Gram Tokenizer, html_ strip as char_ filters, and lower_ case as token_ filters
Use a custom
analyzer, n_
tokenizer, html_
as character filter, and lower_
as token filter to search for HTML entities in queries.
Create a table, insert data and optimize the table to ensure all data is included in results.
CREATE TABLE html_table_n_gram (title VARCHAR(200),content VARCHAR(200),FULLTEXT USING VERSION 2 (title, content) INDEX_OPTIONS'{"analyzer": {"custom": {"char_filters": ["html_strip"],"tokenizer": "n_gram","token_filters":["lower_case"]}}}');INSERT INTO html_table_n_gram (title, content) VALUES('Exciting News', 'We're thrilled to announce our new project!</p>'),('Learning Journey', 'Learning is a never-ending journey & I'm excited!</p>'),('Success Story', 'Our team has achieved great things & we're proud!</p>'),('Grateful Heart', 'Thank you for being a part of our journey & supporting us!</p>'),('Future Goals', 'We're looking forward to achieving even more!</p>');OPTIMIZE TABLE html_table_n_gram FLUSH;
Observe the result of the search query for the misspelled HTML entity and compare the search result with the score below.
SELECT title,content, MATCH(TABLE html_table_n_gram) AGAINST("content:I',") AS scoreFROM html_table_n_gramWHERE scoreORDER BY score DESC;
+------------------+--------------------------------------------------------------------+---------------------+
| title | content | score |
+------------------+--------------------------------------------------------------------+---------------------+
| Learning Journey | Learning is a never-ending journey & I'm excited!</p> | 0.5430432558059692 |
| Success Story | Our team has achieved great things & we're proud!</p> | 0.31375283002853394 |
| Exciting News | We're thrilled to announce our new project!</p> | 0.26527124643325806 |
| Future Goals | We're looking forward to achieving even more!</p> | 0.2177681028842926 |
| Grateful Heart | Thank you for being a part of our journey & supporting us!</p> | 0.1819886565208435 |
+------------------+--------------------------------------------------------------------+---------------------+
Example 8: Portuguese Analyzer with score
Use a portuguese
analyzer to search for a Portuguese text in queries.
Create a table, insert data and optimize the table to ensure all data is included in results.
CREATE TABLE portuguese_news (headline VARCHAR(200),content TEXT,FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS'{"analyzer": "portuguese"}');INSERT INTO portuguese_news (headline, content) VALUES('Cenário Econômico Brasileiro', 'O Brasil enfrenta desafios econômicos com a inflação em alta e a taxa de desemprego ainda elevada.'),('Mercado de Ações em Alta', 'As ações brasileiras registraram ganhos significativos, impulsionadas por resultados financeiros positivos de grandes empresas.'),('Nova Política Monetária do Banco Central', 'O Banco Central do Brasil anunciou mudanças na política monetária para conter a inflação e estimular o crescimento econômico.'),('Investimentos Estrangeiros no Brasil', 'O país atraiu um aumento de investimentos estrangeiros diretos, especialmente em setores de tecnologia e energia renovável.'),('Tendências do Mercado Imobiliário', 'O mercado imobiliário brasileiro mostra sinais de recuperação, com aumento nas vendas de imóveis e novos lançamentos.');OPTIMIZE TABLE portuguese_news FLUSH;
Observe the result of the search query for the Portuguese text and compare the search result with the score below.
SELECT content, MATCH(TABLE portuguese_news) AGAINST ("content:Brasil") AS scoreFROM portuguese_newsWHERE scoreORDER BY score DESC;
+-------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| content | score |
+-------------------------------------------------------------------------------------------------------------------------------------+---------------------+
| O Brasil enfrenta desafios econômicos com a inflação em alta e a taxa de desemprego ainda elevada. | 0.22189012169837952 |
| O Banco Central do Brasil anunciou mudanças na política monetária para conter a inflação e estimular o crescimento econômico. | 0.2059776782989502 |
+-------------------------------------------------------------------------------------------------------------------------------------+---------------------+
Example 9: Spanish Analyzer with custom stopwords
Use a spanish
analyzer with custom stopwords to search for a Spanish text in queries.
Create a table, insert data and optimize the table to ensure all data is included in results.
CREATE TABLE spanish_news (headline VARCHAR(200),content TEXT,FULLTEXT USING VERSION 2 (content) INDEX_OPTIONS'{"analyzer": {"spanish": {"stopset": ["descubrimiento", "tratamiento", "nuevo"]}}}');INSERT INTO spanish_news (headline, content) VALUES('Descubrimiento de un nuevo tratamiento para la diabetes', 'Investigadores han desarrollado un tratamiento innovador que mejora el control del azúcar en sangre en pacientes diabéticos.'),('Avances en la detección temprana del cáncer', 'Un nuevo método permite detectar el cáncer en etapas más tempranas, aumentando las posibilidades de tratamiento exitoso.'),('Nuevo enfoque para tratar enfermedades cardíacas', 'Se ha introducido un nuevo enfoque terapéutico que reduce significativamente el riesgo de ataques cardíacos.'),('Investigación sobre un gen relacionado con el Alzheimer', 'Científicos han identificado un gen que podría estar vinculado a la enfermedad de Alzheimer, lo que abre nuevas posibilidades para el tratamiento.'),('Desarrollo de una vacuna contra COVID-19', 'Un equipo de investigadores ha anunciado resultados prometedores en la efectividad de una nueva vacuna contra COVID-19.');OPTIMIZE TABLE spanish_news FLUSH;
Observe the results of two search queries below: one for the defined Spanish stopword and another for the actual Spanish stopword.
SELECT *FROM spanish_newsWHERE MATCH(TABLE spanish_news) AGAINST("content:nuevo");
Empty set
SELECT *FROM spanish_newsWHERE MATCH(TABLE spanish_news) AGAINST("content:el");
+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| headline | content |
+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| Investigación sobre un gen relacionado con el Alzheimer | Científicos han identificado un gen que podría estar vinculado a la enfermedad de Alzheimer, lo que abre nuevas posibilidades para el tratamiento. |
| Avances en la detección temprana del cáncer | Un nuevo método permite detectar el cáncer en etapas más tempranas, aumentando las posibilidades de tratamiento exitoso. |
| Descubrimiento de un nuevo tratamiento para la diabetes | Investigadores han desarrollado un tratamiento innovador que mejora el control del azúcar en sangre en pacientes diabéticos. |
| Nuevo enfoque para tratar enfermedades cardíacas | Se ha introducido un nuevo enfoque terapéutico que reduce significativamente el riesgo de ataques cardíacos. |
+----------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
Supported Language Analyzers
The following table lists the supported language analyzers.
Language |
Default Stop Word List Link |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
"a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with" |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This is Apache Lucene's Korean (Nori) Analyzer. Custom stopword lists are not supported with the |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Supported Tokenizers
The table below lists supported tokenizers.
Get Parameters
The parameters and description of each of these tokenizers can be obtained from the links included in the table.
Example: Get parameters for the uax_ url_ email
tokenizer
To obtain the parameters for the uax_
tokenizer, follow the tokenizer factory link for the uax_
tokenizer, which can be found in the middle column of the table below.
The following is the tokenizer factory from the uax_
tokenizer, which has been obtained from the tokenizer factory link.maxTokenLength
, which defaults to 255.
<fieldType name="text_urlemail" class="solr.TextField" positionIncrementGap="100"><analyzer><tokenizer class="solr.UAX29URLEmailTokenizerFactory" maxTokenLength="255"/></analyzer></fieldType>
The INDEX_
string to create a full-text index with the uax_maxTokenLength
of 300
is shown below.
INDEX_OPTIONS '{"analyzer": {"custom": {"tokenizer": {"uax_url_email" : {"maxTokenLength": 300}}}}}'
List of Supported Tokennizers
"tokenizer" (Case-Sensitive) |
Tokenizer Factory Link (Includes Parameters) |
Tokenizer Class Link (Includes Description) |
---|---|---|
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
Description: Tokenizer for Korean that uses morphological analysis. Supports the following attributes:
|
Supported Token Filters
This table lists the supported token filters, the name, and a link for the token filter factory documentation which provides parameters and description for the tokenizer.
"token_ |
Lucene Link for Parameters and Description |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
KoreanPartOfSpeechStopFilterFactory A token filter that removes tokens that match a set of part-of-speech tags |
|
KoreanReadingFormFilterFactory A token filter that rewrites tokens written in Hanja to their Hangul form. |
|
A token filter that normalizes Korean numbers to Arabic decimal numbers in half-width characters. |
Supported Character Filters
This table lists the supported character filters, the name, and a link for the parameters.
"char_ |
Lucene Link for Parameters |
---|---|
|
|
|
|
|
|
|
Last modified: January 17, 2025