Working with Full-Text Search
On this page
Full-text search allows searching for words or phrases in a large body of text through an inverted index.fuzzy
and performed over the text types JSON
, CHAR
, VARCHAR
, TEXT
, and LONGTEXT
.
SingleStore supports an updated full-text search process referred to as VERSION 2
.
The following are supported in VERSION 2
:
-
Full-text indexes are only supported on columnstore tables.
-
Full-text indexes can be created with a
CREATE TABLE
statement using the FULLTEXT index type, or can be added to existing tables via an ALTER TABLE ADD FULLTEXT statement. -
Only one full-text index per table is supported.
-
Full-text indexes can be dropped via
DROP INDEX
orALTER TABLE DROP INDEX
statements. -
VERSION 2
uses BM25 scoring.
The USING VERSION 2
syntax must be used in the CREATE TABLE
command to utilize the VERSION 2
process.
CREATE TABLE <table_name> FULLTEXT USING VERSION 2 <fts_index_name> (<fts_col>) | (<fts_col1>,..., <fts_coln>))
Note
The legacy full-text search version is deprecated.VERSION 2
for new development of applications that use full-text search.
VERSION 2
does not allow the column names of full-text columns to contain a $
character.$
character.
The full-text search version (from legacy to version 2) can be changed on an existing table.DROP INDEX
command:
Note
If an index name was not designated when creating the table, the full-text index key_
must be used when dropping the index.key_
is displayed when the SHOW INDEXES FROM <table_
command is executed.
DROP INDEX <fts_index_name> | <index_key_name> ON <table_name>;
Then, ALTER TABLE
command is used with the FULLTEXT USING VERSION 2
argument :
ALTER TABLE <table_name> ADD FULLTEXT USING VERSION 2 fts_index_name (<fts_col>) | (<fts_col1>,..., <fts_coln>);
Content in columns that are full-text indexed can be searched using MATCH , BM25, or BM25_MATCH
, BM25
, or BM25_
clause applies to only one table.MATCH
, BM25
, or BM25_
clauses.
Note
New inserts and updates into columnstore tables may initially be stored in a hidden rowstore table before being flushed to a segment file.
In that case, the full-text index in the columnstore will be updated asynchronously for new inserts and updates.OPTIMIZE TABLE <table_
command.
Since an index is created for each segment file, the distribution of words within the segment may affect the score of full-text queries, especially when the segments have very few rows and the columns have very few words.
For more information on columnstore tables, see How the Columnstore Works.
Version 2 SingleStore Process
SingleStore's VERSION 2
full-text search system uses a JLucene service.
The default analyzer for VERSION 2
is the StandardAnalyzer from Apache Lucene configured to use the StandardTokenizer from Apache Lucene, a lower-case filter, and a set of stopwords.
Version 2 Installation for Self-Managed Users
Full-text version 2 requires Java 21 or later to be installed on all nodes.
On each node, the following engine variables must be set.
-
fts2_
: This is the path to the top-level directory of the Java installation which is stored in the $JAVA_java_ home HOME environment variable. -
fts2_
: This is the path to the Java executable which is typically located in $JAVA_java_ path HOME/bin/java.
For example, assuming all of the hosts have the same Java paths as reflected below, run the following commands to set the fts2_
and fts2_
engine variables to the same Java path values on all nodes in the cluster:
sdb-admin update-config --all --set-global --key "fts2_java_home" --value "/usr/lib/jvm/jre-21-openjdk"
sdb-admin update-config --all --set-global --key "fts2_java_path" --value "/usr/lib/jvm/jre-21-openjdk/bin/java"
If the locations for the Java home directory and Java executable differ from those in the preceding commands, change the path names in the --value arguments to match your environment
To restart your cluster with the updated configurations set, run the following command:
sdb-admin restart-node --all --yes
Refer to sdb-admin update-config for more information about updating nodes in your cluster.
Terms in Full-Text Searches
There are two types of search terms: single terms and phrases.
Search terms can be modified to provide a wide range of search options as described below.
Wildcard Support
Single and multiple character wildcard searches within single terms are supported, but not within search phrases.
-
Use the
?
symbol to perform a single character wildcard search. -
Use the
*
symbol to perform a multi character wildcard search.
Important
Neither ?
or *
are supported at the beginning of a term.
A single character wildcard search matches words based on a single-character.te?t
.
A multiple character wildcard search matches words based on zero or more characters.test*
.te*t
.
The following examples demonstrate how these wildcard searches appear in sample queries:
SELECT * FROM wilsearch1 WHERE MATCH (TABLE wilsearch1) AGAINST ('col1:te?t');
SELECT * FROM wilsearch1 WHERE MATCH (TABLE wilsearch1) AGAINST ('col1:te*t');
Boosting a Term
Boosting a search term means increasing the relevance or importance of that search term in the search results.
For example, if you are searching for Single Store
and you want the term Store
to be more relevant, boost it using the ^ symbol along with the boost factor next to the term, such as you could type Single Store^4
.
You can also boost phrase as in the example: "Single Store"^4 "MySQL"
.
Note
The boost factor is 1 by default.
The following examples demonstrate how these wildcard searches appear in sample queries:
SELECT * FROM bstsearch1 WHERE MATCH (TABLE bstsearch1) AGAINST ('col1:"Single Store"^4 "MySQL"');
SELECT * FROM bssearch1 WHERE MATCH (TABLE bstsearch1) AGAINST ('col1:SingleStore MySQL^0.02');
Operators in Full-Text Search
Full-text version 2 supports operators listed on the Java Lucene full-text search string syntax page.
Grouping Terms in a Query
Grouping Single Terms
Parentheses can be used to group terms to form subqueries.
For example, to search for either "Single" or "Store" and "MemSQL", use the query (Single OR Store) AND MemSQL
.
Grouping Multiple Terms into a Single Field
Parentheses can be used to group multiple clauses into a single field.col1
full-text column or for "MemSQL" in the col2
full-text column.
SELECT * FROM grpsearch1 WHERE MATCH (TABLE grpsearch1) AGAINST ('col1:SingleStore OR col2:MemSQL');
Escaping Special Characters
Special characters that are part of the query syntax must be escaped to directly match them.
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /
Use the \
character before the special character to escape it.(1+1):2
,use the query \(1\+1\)\:2
.
Regular Expression Searches
Regular expression searches that match a pattern between forward slashes ("/") are supported./[mb]oat/
:
SELECT * FROM rexsearch1 WHERE MATCH (TABLE rexsearch1) AGAINST ('col1:/[mb]oat/');
The following example searches for dates with the format of month/day/year where the month and day can have one or two digits and year can have two or four digits.
SELECT * FROM rexsearch1 WHERE MATCH (TABLE rexsearch1) AGAINST ('col1:/\d{1,2}\/\d{1,2}\/\d{2,4}/');
Fuzzy Searches
Fuzzy searches based on the Damerau-Levenshtein Distance are supported.roam~
, which will find terms like "foam" and "roams".
An edit is a change made to a search term to transform it into another string.
Note
Only integer (non-fractional) values are permitted.
The following is an example of a fuzzy search of the term "roam" without the optional parameter:
SELECT * FROM fuzsearch1 WHERE MATCH (TABLE fuzsearch1) AGAINST ('col1:roam~');
The following is an example of a fuzzy search of the term "roam" with the optional parameter:
SELECT * FROM fuzsearch1 WHERE MATCH (TABLE fuzsearch1) AGAINST ('col1:roam~1');
Fuzzy Search OPTIONS Clause
Fuzzy searches can be augmented by using an optional OPTIONS
clause within the MATCH
syntax.
MATCH (TABLE <table_name>) AGAINST (<col_name$key.path:search_term> OPTIONS <json_options>);
The set of supported options are:
-
fuzzy_
: This is the number of characters at the start of a search term that must be identical (not fuzzy) to the query term if the query is to match the search term.prefix_ length The default value is 0
. -
fuzzy_
: This is the maximum number of terms to match against.max_ expansions The default value is 50
. -
fuzzy_
: This allows transpositions with adjacent characters.transpositions This can capture some errors more efficiently, particularly mistyped adjacent characters. When transposition is allowed, matching against "ca~" will be returned "ac" also. The default value is TRUE
.
All of the options are case-sensitive, optional, and can be combined as required.
SELECT * FROM articles WHERE MATCH (TABLE articles) AGAINST ('body:roam~' OPTIONS '{"fuzzy_prefix_length": 2}');
SELECT * FROM articlesWHERE MATCH (TABLE articles) AGAINST ('body:roam~' OPTIONS '{"fuzzy_prefix_length": 1, "fuzzy_transpositions": false}');
Proximity Searches
Finding words within a specified distance away is supported."Single Store"~10
:
SELECT * FROM prxsearch1 WHERE MATCH (TABLE prxsearch1) AGAINST ('col1:"Single Store"~10');
Range Searches
Range searches match rows where the column(s) values are between the lower and upper bound specified by the query.
Square brackets ("[]") indicate an inclusive search, curly brackets ("{}") indicate an exclusive search.
An inclusive range search matches all rows where the values are between the search terms, including values equal to the search terms.col1
is 2008 through 2020:
SELECT * FROM rngsearch1WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:[2008 TO 2020]');
An exclusive range search matches all rows where the values are between the search terms, but are not equal to the search terms.col1
is between 2008 and 2020, but is not equal to 2008 or 2020:
SELECT * FROM rngsearch1WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:{2008 TO 2020}');
Range searches may be lexicographic or numeric.
The following query will use a numeric search as both the values 2008 and 2020 are numeric.
SELECT * FROM rngsearch1WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:{2008 TO 2020}');
The following query will use a lexicographic search as both of the range values are strings that do not contain numeric values.
SELECT * FROM rngsearch1WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:[A TO B}');
This query will use a numeric search as both the values "2008" and "2020" can be parsed numerically.
SELECT * FROM rngsearch1WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:{"2008" TO "2020"}');
It is possible to force a lexicographic search on a numeric-style range clause.
SELECT * FROM rngsearch1WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:[$2008$ TO $2020$]'
ORDER BY … LIMIT
Optimization
Full-text search queries can take advantage of an optimization that pushes down the value of LIMIT
in an ORDER BY … LIMIT
query into the full-text index search.
For this optimization to work, the query must follow these rules:
-
The
WHERE
clause must either be empty or the expression in theWHERE
clause must be the same full-text expression that is in theORDER BY
clause. -
The
ORDER BY
clause must useDESC
sort order. -
There must be only one full-text search function (
MATCH
,BM25
, orBM25_
) in the query.GLOBAL
Relevancy Score
The relevancy score of an expression in a MATCH
statement denotes the ranking of the expression based on the following factors:
-
Number of times an expression appears in a column.
More occurrences of an expression in the matched column(s) increases its relevancy score. -
Rarity of the expression.
Rare words have a higher relevancy score than commonly used words. -
The length of the column containing the expression.
A column with a short expression has a higher relevancy score than a column with a long expression.
BM25 Scoring
Full-text search version 2 supports BM25 scoring for.
Status
The Alloc_
status variables is the number of allocated bytes (out of the total max_
) that is attributable to the next-generation (VERSION 2
) full-text subprocess.SHOW STATUS EXTENDED
command.
Configurations
Below is a set of global variables that can be used to configure full-text search version 2.
If your system is experiencing high load due to full-text index builds, SingleStore recommends reducing the value of fts2_
to 16
or 8
.
Engine Variable |
Default |
Description |
---|---|---|
|
|
Collocated service request endpoint for full-text search version 2. |
|
|
Collocated service control endpoint for full-text search version 2. |
|
None |
If set, this provides the path for the Java executable used by full-text search version 2. |
|
None |
If set, this provides the JAVA_ |
|
|
The amount of heap memory with which to start the Java process for the full-text search version 2 collocated service. |
|
|
The maximum number of parallel requests that the full-text search version 2 collocated service can accept before queuing them. The default value is |
|
|
Sets the maximum size of the memory-mapped region used to communicate with the FTS service (in MB). |
|
|
The maximum amount of heap memory that the Java process for the full-text search version 2 collocated service may use. |
|
|
Sets the maximum size of the memory-mapped region used to communicate with the FTS service (in MB). |
|
|
This is the number of seconds between monitor polls (0 disables it). |
|
|
Sets the positionIncrementGap for use when indexing multi-valued fields. |
|
|
Sets the maximum number of seconds that a query will wait for a response from the full-text search version 2 service before returning a timeout error. |
|
|
The maximum number of seconds to wait for the FTS V2 collocated service to become initially pingable before it will be automatically killed and restarted. |
|
|
When the full-text search version 2 collocated service is shut down in a controlled manner, this sets the number of seconds to wait before it must be forcibly killed. |
Examples
Create a Version 2 Full-Text Search Index and Query Using MATCH
This example creates a FULLTEXT
index for both the title column and the body column.MATCH (TABLE <table_
, and the index on the column will be applied.
CREATE TABLE articles (id INT UNSIGNED,year int UNSIGNED,title VARCHAR(200),body TEXT,SORT KEY (id),FULLTEXT USING VERSION 2 art_ft_index (title, body));
INSERT INTO articles (id, year, title, body) VALUES(1, 2021, 'Introduction to SQL', 'SQL is a standard language for accessing and manipulating databases.'),(2, 2022, 'Advanced SQL Techniques', 'Explore advanced techniques and functions in SQL for better data manipulation.'),(3, 2020, 'Database Optimization', 'Learn about various optimization techniques to improve database performance.'),(4, 2023, 'SQL in Web Development', 'Discover how SQL is used in web development to interact with databases.'),(5, 2019, 'Data Security in SQL', 'An overview of best practices for securing data in SQL databases.'),(6, 2021, 'SQL and Data Analysis', 'Using SQL for effective data analysis and reporting.'),(7, 2022, 'Introduction to Database Design', 'Fundamentals of designing a robust and scalable database.'),(8, 2020, 'SQL Performance Tuning', 'Tips and techniques for tuning SQL queries for better performance.'),(9, 2023, 'Using SQL with Python', 'Integrating SQL with Python for data science and automation tasks.'),(10, 2019, 'NoSQL vs SQL', 'A comparison of NoSQL and SQL databases and their use cases.'),(11, 2020, 'Real-time Data Analysis', 'An introduction to real-time analytics.'),(12, 2021, 'Analysis for Beginners', 'Simple examples of real time analytics.'),(13, 2023, 'Data-Dictionary Design', 'Create and maintain effective data dictionaries.'),(14, 2024, 'Scalable Performance', 'Designing for scalability.');OPTIMIZE TABLE articles FLUSH;
SELECT *FROM articlesWHERE MATCH (TABLE articles) AGAINST ('body:database');
+----+------+---------------------------------+-----------------------------------------------------------------------------+
| id | year | title | body |
+----+------+---------------------------------+-----------------------------------------------------------------------------+
| 7 | 2022 | Introduction to Database Design | Fundamentals of designing a robust and scalable database. |
| 3 | 2020 | Database Optimization | Learn about various optimization techniques to improve database performance.|
+----+------+---------------------------------+-----------------------------------------------------------------------------+
SELECT titleFROM articlesWHERE MATCH (TABLE articles) AGAINST ('title:Database OR body:("Business Intelligence")');
+---------------------------------+
| title |
+---------------------------------+
| Introduction to Database Design |
| Database Optimization |
+---------------------------------+
SELECT titleFROM articlesWHERE MATCH (TABLE articles) AGAINST ('title:SQL AND title:("Data Security")');
+----------------------+
| title |
+----------------------+
| Data Security in SQL |
+----------------------+
SELECT titleFROM articlesWHERE MATCH (TABLE articles) AGAINST ('title:SQL OR body:("Business Intelligence")')AND MATCH (TABLE articles) AGAINST ('body:web+');
+------------------------+
| title |
+------------------------+
| SQL in Web Development |
+------------------------+
SELECT titleFROM articlesWHERE MATCH (TABLE articles) AGAINST ('title:(+Data*) OR title:function?');
+---------------------------------+
| title |
+---------------------------------+
| Introduction to Database Design |
| SQL and Data Analysis |
| Database Optimization |
| Data Security in SQL |
| Data-Dictionary Design |
| Real-time Data Analysis |
+---------------------------------+
Create a Version 2 Full-Text Index over JSON
A full-text index can be created over a JSON column in the same manner it can be created over any other text-type column.
CREATE TABLE ft_records (id INT UNSIGNED,title VARCHAR(200),records JSON,SORT KEY(id),FULLTEXT USING VERSION 2 rec_ft_index (title, records));
The full-text index is created over a JSON column by concatenating all leaf string values in the JSON as a multi-valued field.fts2_
defines the logical spacing between concatenated leaf string values to prevent matching across different leaf string values.
INSERT INTO ft_records VALUES (1,'document','{"k1": "cucumber","k2": ["dragonfruit", "eggplant"],"k3": [{"k3_1": "fig", "k3_2": "grape"},{"k3_1": ["huckleberry", "iceberg lettuce"]},"jicama"]}');OPTIMIZE TABLE ft_records FLUSH;
SELECT title, recordsFROM ft_records WHERE id = 1;
+----------+--------------------------------------------------------------------------------------------------------------------------------------------+
| title | records |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------+
| document | {"k1":"cucumber","k2":["dragonfruit","eggplant"],"k3":[{"k3_1":"fig","k3_2":"grape"},{"k3_1":["huckleberry","iceberg lettuce"]},"jicama"]} |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------+
The JSON document for the records
column for the row inserted above is shown below.
{"title": "document","records": ["cucumber", "dragonfruit", "eggplant","fig", "grape", "huckleberry","iceberg lettuce", "jicama"],"records$k1": "cucumber","records$k2": ["dragonfruit", "eggplant"],"records$k3": "jicama","records$k3.k3_1": ["fig", "huckleberry", "iceberg lettuce"],"records$k3.k3_2": "grape"}
Here the records
column is a multi-valued field of all leaf string values in JSON.fts2_
.
Along with records
, other fields like records$k1
and records$k3.
are created allow searching at each unique keypath present in the JSON document.
Query over JSON columns
Querying over the entire JSON column can be performed in the same way as with any other column part of the full-text index.
SELECT (MATCH (TABLE ft_records) AGAINST ('records:/.*cumber/')) AS cumberFROM ft_records;
+----------+
| cumber |
+----------+
| 1 |
+----------+
To query over a specific JSON keypath, use a query similar to the following:
SELECT (MATCH (TABLE ft_records) AGAINST ('records$k3.k3_1:fig')) AS figFROM ft_records;
+------------------------+
| fig |
+------------------------+
| 0.13076457381248474 |
+------------------------+
The example above shows how you can search for the string fig
at the keypath k3.
in the records
column using the field grouping syntax.
SELECT idFROM ft_recordsWHERE MATCH (TABLE ft_records) AGAINST ('records:(iceberg lettuce)');
+------+
| id |
+------+
| 1 |
+------+
SELECT idFROM ft_recordsWHERE MATCH (TABLE ft_records) AGAINST ('records:"cucumber dragonfruit"');
Empty set (0.008 sec)
SELECT idFROM ft_recordsWHERE MATCH (TABLE ft_records) AGAINST ('records:(cucumber dragonfruit~100)');
+------+
| id |
+------+
| 1 |
+------+
In the above examples, you can see that the phrase query cucumber dragonfruit
does not match because they belong to different leaf strings.fts2_
.
Example - ORDER BY … LIMIT
The query below will take advantage of the ORDER BY.
pushdown.MATCH
function, named match_
, the columns in the WHERE
and ORDER BY
clauses are the same, and the sort order is DESC
.
SELECT id, title,MATCH (TABLE articles) AGAINST ('body:database') AS match_resFROM articlesWHERE match_resORDER BY match_res DESC LIMIT 25;
Legacy (Version 1) SingleStore Process
SingleStore's legacy full-text search system uses a CLucene service which is embedded in the SingleStore database engine.
The following example illustrates how to create a table with a legacy version full-text search index and how to query from that table.USING VERSION 1
syntax is optional.
CREATE TABLE articles (id INT UNSIGNED,year int UNSIGNED,title VARCHAR(200),body TEXT,SORT KEY (id),FULLTEXT USING VERSION 1 (title, body));
SELECT * FROM articlesWHERE MATCH (title,body)AGAINST ('database');
Refer to the MATCH page for more examples.
Index Repair
Full-text index creation failure is rare.ER_
.OPTIMIZE TABLE <tablename> FIX_
.
Related Topics
-
Working with Vector Data- Allows for semantic searching which is searching based on meanings, not keywords.
-
Hybrid Search - Allows full-text and vector search methods in one query.
Full-text and vector search ranking can be combined. -
HIGHLIGHT -
HIGHLIGHT
is not supported inVERSION 2
. -
Training: Full-Text Index and Search
Last modified: November 8, 2024