Working with Full-Text Search

Full-text search allows searching for words or phrases in a large body of text through an inverted index. The search can be exact or fuzzy and performed over the text types JSON, CHAR, VARCHAR, TEXT, and LONGTEXT.

SingleStore supports an updated full-text search process referred to as VERSION 2. For information on Version 1, refer to Legacy (Version 1) SingleStore Process.

The following are supported in VERSION 2:

  • Full-text indexes are only supported on columnstore tables.

  • Full-text indexes can be created with a CREATE TABLE statement using the FULLTEXT index type, or can be added to existing tables via an ALTER TABLE ADD FULLTEXT statement.

  • Only one full-text index per table is supported.

  • Full-text indexes can be dropped via DROP INDEX or ALTER TABLE DROP INDEX statements.

The USING VERSION 2 syntax must be used in the CREATE TABLE command to utilize the VERSION 2 process.

CREATE TABLE <table_name> FULLTEXT USING VERSION 2 <fts_index_name> (<fts_col>) | (<fts_col1>,..., <fts_coln>))

Note

The legacy full-text search version is deprecated. SingleStore recommends using VERSION 2 for new development of applications that use full-text search.

VERSION 2 does not allow the column names of full-text columns to contain a $ character. An error will be generated if a full-text column name contains a $ character.

The full-text search version (from legacy to version 2) can be changed on an existing table. First, the existing full-text index must be dropped using the DROP INDEX command:

Note

If an index name was not designated when creating the table, the full-text index key_name must be used when dropping the index. The full-text index key_name is displayed when the SHOW INDEXES FROM <table_name> command is executed.

DROP INDEX <fts_index_name> | <index_key_name> ON <table_name>;

Then, ALTER TABLE command is used with the FULLTEXT USING VERSION 2 argument :

ALTER TABLE <table_name> ADD FULLTEXT USING VERSION 2 fts_index_name (<fts_col>) | (<fts_col1>,..., <fts_coln>);

Content in columns that are full-text indexed can be searched using the MATCH syntax. Each MATCH clause applies to only one table. To search against multiple tables, specify multiple MATCH clauses.

Note

New inserts and updates into columnstore tables may initially be stored in a hidden rowstore table before being flushed to a segment file. The affected segment is re-indexed when the background flusher runs.

In that case, the full-text index in the columnstore will be updated asynchronously for new inserts and updates. Inserts and updates from this rowstore table can be force-pushed  to the columnstore table by using the OPTIMIZE TABLE <table_name> FLUSH command.

Since an index is created for each segment file, the distribution of words within the segment may affect the score of full-text queries, especially when the segments have very few rows and the columns have very few words.

For more information on columnstore tables, see How the Columnstore Works.

Version 2 SingleStore Process

SingleStore's VERSION 2 full-text search system uses a JLucene service. The JLucene service is a Java process that provides an interface for the SingleStore engine. This interface allows the engine to perform full-text searches and create full-text search indexes for future use. The JLucene service uses software from the Apache Lucene project. The SingleStore engine and the JLucene service run on the same machine. Communication between the two processes occurs through domain sockets and shared memory. Typically, communication occurs on a per-segment basis.

Terms in Full-Text Searches

There are two types of search terms: single terms and phrases. A single term is a single word such as test or hello which does not require quotes. A phrase is a group of words surrounded by double quotes such as "hello SingleStore". Multiple search terms can be combined with Boolean operators to form more complex queries.

Search terms can be modified to provide a wide range of search options as described below.

Wildcard Support

Single and multiple character wildcard searches within single terms are supported, but not within search phrases.

  • Use the ? symbol to perform a single character wildcard search.

  • Use the * symbol to perform a multi character wildcard search.

Important

Neither ? or * are supported at the beginning of a term. For example, searching for ?ello or *ello will generate an error.

A single character wildcard search matches words based on a single-character. For example, to search for “text” or “test”, use the search term: te?t.

A multiple character wildcard search matches words based on zero or more characters. For example, to search for “test”, “tests”, or “tester”, use the search term test*. Wildcard searches in the middle of a search term can also be used, such as te*t.

The following examples demonstrate how these wildcard searches appear in sample queries:

SELECT * FROM wilsearch1 WHERE MATCH (TABLE wilsearch1) AGAINST ('col1:te?t');
SELECT * FROM wilsearch1 WHERE MATCH (TABLE wilsearch1) AGAINST ('col1:te*t');

Boosting a Term

Boosting a search term means increasing the relevance or importance of that search term in the search results. To boost a search term use the caret ("^") symbol with a boost factor (a number) at the end of the search term. The higher the boost factor, the more relevant the search term will be.

For example, if you are searching for Single Store and you want the term Store to be more relevant, boost it using the ^ symbol along with the boost factor next to the term, such as you could type Single Store^4. This will make rows with the term Store appear more relevant.

You can also boost phrase as in the example: "Single Store"^4 "MySQL".

Note

The boost factor is 1 by default. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2), which leads to the term or phrase have less relevance.

The following examples demonstrate how these wildcard searches appear in sample queries:

SELECT * FROM bstsearch1 WHERE MATCH (TABLE bstsearch1) AGAINST ('col1:"Single Store"^4 "MySQL"');
SELECT * FROM bssearch1 WHERE MATCH (TABLE bstsearch1) AGAINST ('col1:SingleStore MySQL^0.02');

Full-text version 2 supports operators listed on the Java Lucene full-text search string syntax page.

Grouping Terms in a Query

Grouping Single Terms

Parentheses can be used to group terms to form subqueries. This can be useful for controlling the boolean logic for a query.

For example, to search for either "Single" or "Store" and "MemSQL", use the query (Single OR Store) AND MemSQL. This ensures that "MemSQL" exists with either the"Single" or "Store" search terms.

Grouping Multiple Terms into a Single Field

Parentheses can be used to group multiple clauses into a single field. The following query can be used to search for "SingleStore" in the col1 full-text column or for "MemSQL" in the col2 full-text column.

SELECT * FROM grpsearch1 WHERE MATCH (TABLE grpsearch1) AGAINST ('col1:SingleStore OR col2:MemSQL');

Escaping Special Characters

Special characters that are part of the query syntax must be escaped to directly match them. These special characters are:

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /

Use the \ character before the special character to escape it. For example, to search for (1+1):2 ,use the query \(1\+1\)\:2.

Regular Expression Searches

Regular expression searches that match a pattern between forward slashes ("/") are supported. For example, to find rows containing "moat" or "boat", use /[mb]oat/:

SELECT * FROM rexsearch1 WHERE MATCH (TABLE rexsearch1) AGAINST ('col1:/[mb]oat/');

The following example searches for dates with the format of month/day/year where the month and day can have one or two digits and year can have two or four digits.

SELECT * FROM rexsearch1 WHERE MATCH (TABLE rexsearch1) AGAINST ('col1:/\d{1,2}\/\d{1,2}\/\d{2,4}/');

Fuzzy Searches

Fuzzy searches based on the Damerau-Levenshtein Distance are supported. To perform a fuzzy search use the tilde ("~") symbol at the end of a single term. For example, to search for a term similar in spelling to "roam" use the fuzzy search: roam~, which will find terms like "foam" and "roams".

An edit is a change made to a search term to transform it into another string. An optional “edit” parameter can be used after the tilde to specify the maximum number of edits permitted. This value can be 0, 1, or 2. If not specified, the default value is 2 edits.

Note

Only integer (non-fractional) values are permitted.

The following is an example of a fuzzy search of the term "roam" without the optional parameter:

SELECT * FROM fuzsearch1 WHERE MATCH (TABLE fuzsearch1) AGAINST ('col1:roam~');

The following is an example of a fuzzy search of the term "roam" with the optional parameter:

SELECT * FROM fuzsearch1 WHERE MATCH (TABLE fuzsearch1) AGAINST ('col1:roam~1');

Fuzzy Search OPTIONS Clause

Fuzzy searches can be augmented by using an optional OPTIONS clause within the MATCH syntax. The syntax for adding options is as follows:

MATCH (TABLE <table_name>) AGAINST (<col_name$key.path:search_term> OPTIONS <json_options>);

The set of supported options are:

  • fuzzy_prefix_length: This is the number of characters at the start of a search term that must be identical (not fuzzy) to the query term if the query is to match the search term. The default value is 0.

  • fuzzy_max_expansions : This is the maximum number of terms to match against. The default value is 50.

  • fuzzy_transpositions: This allows transpositions with adjacent characters. This can capture some errors more efficiently, particularly mistyped adjacent characters. When transposition is allowed, matching against "ca~" will be returned "ac" also. The default value is TRUE.

All of the options are case-sensitive, optional, and can be combined as required. For example:

SELECT * FROM articles WHERE MATCH (TABLE articles) AGAINST ('body:roam~' OPTIONS '{"fuzzy_prefix_length": 2}');
SELECT * FROM articles
WHERE MATCH (TABLE articles) AGAINST ('body:roam~' OPTIONS '{"fuzzy_prefix_length": 1, "fuzzy_transpositions": false}');

Proximity Searches

Finding words within a specified distance away is supported. To perform a proximity search, use the tilde ("~") symbol at the end of a search phrase. For example, to search for a "Single" and "Store" within 10 words of each other in a row, use the search "Single Store"~10:

SELECT * FROM prxsearch1 WHERE MATCH (TABLE prxsearch1) AGAINST ('col1:"Single Store"~10');

Range Searches

Range searches match rows where the column(s) values are between the lower and upper bound specified by the query. Range searches can be inclusive or exclusive of the upper and lower bounds. Sorting is done lexicographically or by value depending on whether the search is over string or numeric values.

Square brackets ("[]") indicate an inclusive search, curly brackets ("{}") indicate an exclusive search. A range clause of [123 TO 345] implies a numeric range search, a range clause of [abc TO def] implies a lexicographic range search, and [$123$ TO $345$] forces a lexicographic range despite the numeric range values. Specific examples are provided below.

An inclusive range search matches all rows where the values are between the search terms, including values equal to the search terms. Inclusive range queries are denoted by square brackets (“[ ]”). For example, the following query searches for rows where the numeric value of col1 is 2008 through 2020:

SELECT * FROM rngsearch1
WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:[2008 TO 2020]');

An exclusive range search matches all rows where the values are between the search terms, but are not equal to the search terms. Exclusive range queries are denoted by curly brackets (“{ }”). For example, the following query searches for rows where the numeric value of col1 is between 2008 and 2020, but is not equal to 2008 or 2020:

SELECT * FROM rngsearch1
WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:{2008 TO 2020}');

Range searches may be lexicographic or numeric. Typically searches over strings are expected to order lexicographically while searches over numeric values are expected to order by numeric value. The type of range search to use (lexicographic or numeric) is determined by the values entered in the range clause in the query. If both the upper and lower bounds of the range are numeric, then a numeric range search will be performed. If either value is not numeric, then a lexicographic query will be used.

The following query will use a numeric search as both the values 2008 and 2020 are numeric.

SELECT * FROM rngsearch1
WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:{2008 TO 2020}');

The following query will use a lexicographic search as both of the range values are strings that do not contain numeric values. Note that this query is inclusive of A, but exclusive of B.

SELECT * FROM rngsearch1
WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:[A TO B}');

This query will use a numeric search as both the values "2008" and "2020" can be parsed numerically.

SELECT * FROM rngsearch1
WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:{"2008" TO "2020"}');

It is possible to force a lexicographic search on a numeric-style range clause. To do so, surround the numeric upper and lower bounds with the '$' character. The following query will use a lexicographic search.

SELECT * FROM rngsearch1
WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:[$2008$ TO $2020$]'

Relevancy Score

The relevancy score of an expression in a MATCH statement denotes the ranking of the expression based on the following factors:

  • Number of times an expression appears in a column. More occurrences of an expression in the matched column(s) increases its relevancy score.

  • Rarity of the expression. Rare words have a higher relevancy score than commonly used words.

  • The length of the column containing the expression. A column with a short expression has a higher relevancy score than a column with a long expression.

BM25 Scoring

Full-text search version 2 supports BM25 scoring for term queries only and always operates at the partition level. Partition level scoring improves accuracy across a larger set of data. Collection and term statistics are calculated within a partition, ensuring accurate scores relative to a row's placement within that partition.

The BM25( ) function returns the computed BM25 score within the partition. For querying unique keypaths on JSON documents, the <term> argument contains the column name with keypath and the term.

Queries using the MATCH syntax are still scored on a segment level.

To employ BM25 partition-scoped scoring, use the BM25() function:

BM25(<table_name>,'<column>:<term>')

The syntax for the BM25() function for JSON columns is:

BM25(<table_name>,'<column>$<key.path>:<term>')

Obtaining a BM25 Score Example

CREATE TABLE scoring (a INT, b TEXT, c JSON, SORT KEY(a),
FULLTEXT USING VERSION 2 scr_ft_index (b, c));
INSERT INTO scoring VALUES
(1, 'one two three three four four four five five five five five','{"one": "two"}');
INSERT INTO scoring VALUES
(2, 'one two two three three three three four four four four four four four four','{"one": "three"}');
OPTIMIZE TABLE scoring FLUSH;
SELECT a
, ROUND(BM25(scoring,'b:one'), 3) AS score_one
, ROUND(BM25(scoring,'b:two'), 3) AS score_two
, ROUND(0.6*BM25(scoring,'b:three') + 0.4*BM25(scoring,'b:four'),3) AS score_boosted
FROM scoring;
+-----+-----------+-----------+---------------+
| a   | score_one | score_two | score_boosted |
+-----+-----------+-----------+---------------+
| 1   |     0.087 |     0.087 |         0.124 |
| 2   |     0.079 |     0.110 |         0.145 |
+-----+-----------+-----------+---------------+
 

To obtain BM25 scores over JSON documents at a keypath, use the following query:

SELECT a, ROUND(BM25(scoring, 'c$one:three'), 3) AS score_one FROM scoring;
+-----+-----------+
| a   | score_one |
+-----+-----------+
| 1   |     0.000 | 
| 2   |     0.131 |
+-----+-----------+

Status

The Alloc_fts2_svc status variables is the number of allocated bytes (out of the total max_memory) that is attributable to the next-generation (VERSION 2) full-text subprocess. This status variable will be exposed in the SHOW STATUS EXTENDED command.

Configurations

Below is a set of global variables that can be used to configure full-text search version 2.

Engine Variable

Default

Description

fts2_endpoint

/tmp/s2-fts

Collocated service request endpoint for full-text search version 2. Not configurable on SingleStore Helios.

fts2_ctrl_endpoint

/tmp/s2-ftsctl

Collocated service control endpoint for full-text search version 2. Not configurable on SingleStore Helios .

fts2_java_path

None

If set, this provides the path for the Java executable used by full-text search version 2. Not configurable on SingleStore Helios.

fts2_java_home

None

If set, this provides the JAVA_HOME for the Full-Text Search service.  Not configurable on SingleStore Helios.

fts2_init_memory_mb

0

The amount of heap memory with which to start the Java process for the full-text search version 2 collocated service. If it is 0, the Java default is used.

fts2_max_connections

64

The maximum number of parallel requests that the full-text search version 2 collocated service can accept before queuing them.

fts2_max_service_buffer_mb

1800

Sets the maximum size of the memory-mapped region used to communicate with the FTS service (in MB). Minimum value is 1MB.

fts2_max_memory_mb

0

The maximum amount of heap memory that the Java process for the full-text search version 2 collocated service may use. If it is 0, the Java default is used.

fts2_max_service_buffer_mb

2000

Sets the maximum size of the memory-mapped region used to communicate with the FTS service (in MB). Minimum value is 1MB.

fts2_monitor_interval_secs

60

This is the number of seconds between monitor polls (0 disables it).

fts2_position_increment_gap

100

Sets the positionIncrementGap for use when indexing multi-valued fields. Concatenates multiple values within a field. This variable controls the amount of virtual white space that is inserted between same-field values during a search. This helps prevent proximity and phrase queries from erroneously matching across field instances. The default value gives good results in most cases.

fts2_query_timeout

5

Sets the maximum number of seconds that a query will wait for a response from the full-text search version 2 service before returning a timeout error.

fts2_start_timeout_secs

5

The maximum number of seconds to wait for the FTS V2 collocated service to become initially pingable before it will be automatically killed and restarted.

fts2_stop_timeout_secs

60

When the full-text search version 2 collocated service is shut down in a controlled manner, this sets the number of seconds to wait before it must be forcibly killed.

Examples

Create a Version 2 Full-Text Search Index and Query Using MATCH

This example creates a FULLTEXT index for both the title column and the body column. Either column can be queried separately using MATCH (TABLE <table_name>) AGAINST (<expression>), and the index on the column will be applied.

CREATE TABLE articles (
id INT UNSIGNED,
year int UNSIGNED,
title VARCHAR(200),
body TEXT,
SORT KEY (id),
FULLTEXT USING VERSION 2 art_ft_index (title, body));
INSERT INTO articles (id, year, title, body) VALUES
(1, 2021, 'Introduction to SQL', 'SQL is a standard language for accessing and manipulating databases.'),
(2, 2022, 'Advanced SQL Techniques', 'Explore advanced techniques and functions in SQL for better data manipulation.'),
(3, 2020, 'Database Optimization', 'Learn about various optimization techniques to improve database performance.'),
(4, 2023, 'SQL in Web Development', 'Discover how SQL is used in web development to interact with databases.'),
(5, 2019, 'Data Security in SQL', 'An overview of best practices for securing data in SQL databases.'),
(6, 2021, 'SQL and Data Analysis', 'Using SQL for effective data analysis and reporting.'),
(7, 2022, 'Introduction to Database Design', 'Fundamentals of designing a robust and scalable database.'),
(8, 2020, 'SQL Performance Tuning', 'Tips and techniques for tuning SQL queries for better performance.'),
(9, 2023, 'Using SQL with Python', 'Integrating SQL with Python for data science and automation tasks.'),
(10, 2019, 'NoSQL vs SQL', 'A comparison of NoSQL and SQL databases and their use cases.');
OPTIMIZE TABLE articles FLUSH;
SELECT * FROM articles WHERE MATCH (TABLE articles) AGAINST ('body:database');
+----+------+---------------------------------+-----------------------------------------------------------------------------+
| id | year | title                           | body                                                                        |
+----+------+---------------------------------+-----------------------------------------------------------------------------+
| 7  | 2022 | Introduction to Database Design |	Fundamentals of designing a robust and scalable database.                   |
| 3  | 2020 | Database Optimization	      | Learn about various optimization techniques to improve database performance.|
+----+------+---------------------------------+-----------------------------------------------------------------------------+
SELECT title FROM articles WHERE MATCH (TABLE articles) AGAINST ('title:Database OR body:("Business Intelligence")');
+---------------------------------+
| title                           |     
+---------------------------------+
| Introduction to Database Design |
| Database Optimization           |
+---------------------------------+
SELECT title FROM articles WHERE MATCH (TABLE articles) AGAINST ('title:SQL AND title:("Data Security")');
+----------------------+
| title                |     
+----------------------+
| Data Security in SQL |
+----------------------+
SELECT title
FROM articles
WHERE MATCH (TABLE articles) AGAINST ('title:SQL OR body:("Business Intelligence")')
AND MATCH (TABLE articles) AGAINST ('body:web+');
+------------------------+
| title                  |     
+------------------------+
| SQL in Web Development |
+------------------------+
SELECT title
FROM articles
WHERE MATCH (TABLE articles) AGAINST ('title:(+Data*) OR title:function?');
+---------------------------------+
| title                           |     
+---------------------------------+
| Introduction to Database Design |
| SQL and Data Analysis           |
| Database Optimization           |
| Data Security in SQL            |
+---------------------------------+

Create a Full-Text Index over JSON

A full-text index can be created over a JSON column in the same manner it can be created over any other text-type column.

CREATE TABLE ft_records (
id INT UNSIGNED,
title VARCHAR(200),
records JSON,
SORT KEY(id),
FULLTEXT USING VERSION 2 rec_ft_index (title, records));

The full-text index is created over a JSON column by concatenating all leaf string values in the JSON as a multi-valued field. The engine variable fts2_position_increment_gap defines the logical spacing between concatenated leaf string values to prevent matching across different leaf string values. The default value is 100. In addition to having a field for each column in the full-text index, there will also be additional fields for each unique keypath in the JSON document.

INSERT INTO ft_records VALUES (
1,
'document',
'{
"k1": "cucumber",
"k2": ["dragonfruit", "eggplant"],
"k3": [
{"k3_1": "fig", "k3_2": "grape"},
{"k3_1": ["huckleberry", "iceberg lettuce"]},
"jicama"
]
}');
OPTIMIZE TABLE ft_records FLUSH;
SELECT title, records FROM ft_records WHERE id = 1;
+----------+--------------------------------------------------------------------------------------------------------------------------------------------+ 
| title	   | records                                                                                                                                    |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------+
| document | {"k1":"cucumber","k2":["dragonfruit","eggplant"],"k3":[{"k3_1":"fig","k3_2":"grape"},{"k3_1":["huckleberry","iceberg lettuce"]},"jicama"]} | 
+----------+--------------------------------------------------------------------------------------------------------------------------------------------+

The JSON document for the records column for the row inserted above is shown below.

{
"title": "document",
"records": ["cucumber", "dragonfruit", "eggplant",
"fig", "grape", "huckleberry",
"iceberg lettuce", "jicama"],
"records$k1": "cucumber",
"records$k2": ["dragonfruit", "eggplant"],
"records$k3": "jicama",
"records$k3.k3_1": ["fig", "huckleberry", "iceberg lettuce"],
"records$k3.k3_2": "grape"
}

Here the records column is a multi-valued field of all leaf string values in JSON. This field can be queried like any other full-text indexed field. An exception is that matching will not occur over separate values unless the maximum number of positions allowed between matching phrases, or the “slop,” exceeds the fts2_position_increment_gap.

Along with records, other fields like records$k1 and records$k3.k3_1 are created allow searching at each unique keypath present in the JSON document. The dollar sign ("$") is used as a delimiter between the SQL column name and the JSON keypath.

Querying over JSON columns

Querying over the entire JSON column can be performed in the same way as with any other column part of the full-text index.

SELECT (MATCH (TABLE ft_records) AGAINST ('records:/.*cumber/')) FROM ft_records;

To query over a specific JSON keypath, use a query similar to the following:

SELECT (MATCH (TABLE ft_records) AGAINST ('records$k3.k3_1:fig')) FROM ft_records;

The example above shows how you can search for the string fig at the keypath k3.k3_1 in the records column using the field grouping syntax.

SELECT id FROM ft_records WHERE MATCH (TABLE ft_records) AGAINST ('records:(iceberg lettuce)');
+------+ 
| id   |
+------+ 
| 1    | 
+------+ 
SELECT id FROM ft_records WHERE MATCH (TABLE ft_records) AGAINST ('records:"cucumber dragonfruit"');
Empty set (0.008 sec)
SELECT id FROM ft_records WHERE MATCH (TABLE ft_records) AGAINST ('records:(cucumber dragonfruit~100)');
+------+
| id   | 
+------+ 
| 1    | 
+------+ 

In the above examples, you can see that the phrase query cucumber dragonfruit does not match because they belong to different leaf strings. However, when a slop (the maximum number of positions allowed between matching phrases) of 100 is added the row matches because the default fts2_position_increment_gap=100.

Legacy (Version 1) SingleStore Process

SingleStore's legacy full-text search system uses a CLucene service which is embedded in the SingleStore database engine.

The following example illustrates how to create a table with a legacy version full-text search index and how to query from that table. The USING VERSION 1 syntax is optional.

CREATE TABLE articles (
id INT UNSIGNED,
year int UNSIGNED,
title VARCHAR(200),
body TEXT,
SORT KEY (id),
FULLTEXT USING VERSION 1 (title, body));
SELECT * FROM articles
WHERE MATCH (title,body)
AGAINST ('database');

Refer to the MATCH page for more examples.

Last modified: July 24, 2024

Was this article helpful?