Working with Full-Text Search

Overview

SingleStore provides full-text search compatible with Apache's Java Lucene (Apache Lucene Core) and which uses BM25 scoring.

Below are select examples of this VERSION 2 full-text search. These examples are intended to provide a high-level introduction to SingleStore's full-text search. Details and additional examples are provided in sections below.

The following articles table is used for the examples.

SQL

CREATE TABLE articles (
    id INT UNSIGNED,
    year int UNSIGNED,
    title VARCHAR(200),
    body TEXT,
    SORT KEY (id),    
    FULLTEXT USING VERSION 2 art_ft_index (title, body));

This example searches for articles containing the term Database in the title or the phrase "Business Intelligence" in the body of the article:

SQL

SELECT title, body
FROM articles
WHERE MATCH (TABLE articles) AGAINST ('title:Database OR body:("Business Intelligence")');

A proximity search can be used to find articles that contain the phrase "SQL databases" in the body of the article, with the specification that SQL and databases must appear within 5 words of each other.

SQL

SELECT title, body
FROM articles
WHERE MATCH (TABLE articles) AGAINST ('body:"SQL databases"~5');

A fuzzy search can be used to find articles with words in the body of the article that are within a Levenshtein edit distance of 2 of the term dtabase, which in the following example is deliberately misspelled.

SQL

SELECT title, body
FROM articles
WHERE MATCH (TABLE articles) AGAINST ('body:dtabase~2');

Regular expressions can be used in full-text searches. The following query uses a regular expression for articles with words that begin with data in the body of the article.

SQL

SELECT title, body
FROM articles
WHERE MATCH (TABLE articles) AGAINST ('body:data*');

Finally, full-text search can also be used on JSON columns; refer to Create a Version 2 Full-Text Index over JSON examples.

The section Related Topics contains additional resources on full-text search including using full-text search in hybrid search and configuring full-text indexes for high performance.

Version 2 SingleStore Process

SingleStore's VERSION 2 full-text search system uses a JLucene service. The JLucene service is a Java process that provides an interface for the SingleStore engine. This interface allows the engine to perform full-text searches and create full-text search indexes for future use. The JLucene service uses software from the Apache Lucene project. The SingleStore engine and the JLucene service run on the same machine. Communication between the two processes occurs through domain sockets and shared memory. Typically, communication occurs on a per-segment basis.

The default analyzer for VERSION 2 is the StandardAnalyzer from Apache Lucene configured to use the StandardTokenizer from Apache Lucene, a lower-case filter, and a set of stopwords. The StandardTokenizer is a grammar-based tokenizer which uses the word break rules from the Unicode Text Segmentation, as specified in Unicode Standard Annex #29.

Create, Add, and Drop Full-Text Indexes

SingleStore recommends using VERSION 2 full-text search for new development. SingleStore's Legacy (Version 1) full-text search has been deprecated.

Create a Version 2 Full-Text Index

Create a version 2 full-text index with a CREATE TABLE statement using the FULLTEXT USING VERSION 2 index type. The USING VERSION 2 syntax must be used in the CREATE TABLE command to utilize the VERSION 2 process.

SQL

CREATE TABLE <table_name> ( 
    <column_definitions>, 
    FULLTEXT USING VERSION 2 [<fts_index_name>] 
    (<fts_col1>,..., <fts_coln>)
    [INDEX_OPTIONS '{...}']
);

Add a version 2 full-text index to a table ALTER TABLE ADD FULLTEXT statement.

SQL

ALTER TABLE <table_name> 
ADD FULLTEXT USING VERSION 2 [fts_index_name] 
(<fts_col1>,..., <fts_coln>)
[INDEX_OPTIONS '{...}']
;

To add a column to a full-text index, drop and recreate the index.

Drop a Full-Text Index

Drop a full-text index using a DROP INDEX or an ALTER TABLE DROP INDEX statement.

SQL

DROP INDEX <fts_index_name> | <index_key_name> ON <table_name>;

SQL

ALTER TABLE <table_name> DROP INDEX <fts_index_name> | <index_key_name>;

If an index name was not designated when creating the table, the full-text index index_key_name must be used when dropping the index. The full-text index index_key_name is displayed when the SHOW INDEXES FROM <table_name> command is executed.

Remarks

The following apply to VERSION 2:

Full-text indexes are only supported on columnstore tables (How the Columnstore Works).
Only one full-text index is supported per table.
During indexing, column values are split into tokens, which are turned into indexed terms. The maximum length of an indexed term is 32766 bytes.
- If a column value is longer than that limit, you will see a message similar to: Forwarding Error (<node>): Leaf Error (<node>): Document contains at least one immense term in field=<field> (whose UTF8 encoding is longer than the max length 32766), all of which were skipped...
A MATCH … AGAINST clause may not refer to a CTE (WITH (Common Table Expressions)) because the CTE produces a dynamic table which does not have a full-text index. Similarly, MATCH … AGAINST clauses may not refer to derived tables.
- These restrictions apply to MATCH, BM25, and BM25_GLOBAL.
New inserts and updates into columnstore tables may initially be stored in a hidden rowstore table before being flushed to a segment file. The affected segment is re-indexed when the background flusher runs.
- In that case, the full-text index in the columnstore will be updated asynchronously for new inserts and updates. Inserts and updates from this rowstore table can be force-pushed to the columnstore table by using the OPTIMIZE TABLE <table_name> FLUSH command.
Since an index is created for each segment file, the distribution of words within the segment may affect the score of full-text queries, especially when the segments have very few rows and the columns have very few words.

Upgrade to Full-Text Version 2

A table can be upgraded from legacy full-text search to VERSION 2 full-text search. To do so:

Drop the existing full-text index using the DROP INDEX command.
Use the ALTER TABLE command with the FULLTEXT USING VERSION 2 argument to create a VERSION 2 full-text index.

Once you have upgraded your table to use a VERSION 2 full-text index, you will also need to change your queries to use the VERSION 2 query syntax.

Query Full-Text Indexes

A full-text index search matches a search term or terms to content in a table that has been full-text indexed.

The following query which uses the articles table from the Overview finds all articles with the word Database in the title. The title column has a full-text index on it.

SQL

SELECT title, body
FROM articles
WHERE MATCH (TABLE articles) AGAINST ('title:Database');

Terms can be single terms or phrases, can be modified with wildcards or boosted, can be combined with boolean operators, and more as described in the sections below.

The MATCH, BM25, or BM25_GLOBAL functions can be used to search full-text indexed content. These functions provide different tradeoffs between efficiency and accuracy with MATCH being the most efficient. SingleStore recommends using MATCH for most applications. Refer to MATCH and BM25 Scoring and Comparison of BM25 and BM25_GLOBAL for more details.

SingleStore supports custom analyzers for full-text VERSION 2 search. Users can customize full-text search by:

Using built-in analyzers for a variety of languages. The built-in analyzers can be customized with custom stop-word lists.
Using custom analyzers in which a user can specify a tokenizer, optional token and character filters, and an optional stop-word list.

JSON columns can be searched using full-text search as shown in Example 7: Score Over JSON.

Remarks

Each MATCH, BM25, or BM25_GLOBAL clause applies to only one table.
To search against multiple tables, specify multiple MATCH, BM25, or BM25_GLOBAL clauses.

Terms in Full-Text Searches

There are two types of search terms: single terms and phrases. A single term is a single word such as test or hello which does not require quotes. A phrase is a group of words surrounded by double quotes such as "hello SingleStore". Multiple search terms can be combined with Boolean operators to form more complex queries.

Search terms can be modified to provide a wide range of search options as described below.

Wildcard Support

Single and multiple character wildcard searches within single terms are supported, but not within search phrases.

Use the ? symbol to perform a single character wildcard search.
Use the * symbol to perform a multi character wildcard search.

Important

Neither ? or * are supported at the beginning of a term. For example, searching for ?ello or *ello will generate an error.

Refer to Example 12 for an example of using a n_gram tokenizer to emulate using a wildcard at the start and end of a string and search for a substring.

A single character wildcard search matches words based on a single character. For example, to search for “text” or “test”, use the search term: te?t.

A multiple character wildcard search matches words based on zero or more characters. For example, to search for “test”, “tests”, or “tester”, use the search term test*. Wildcard searches in the middle of a search term can also be used, such as te*t.

The following examples demonstrate how these wildcard searches appear in sample queries:

SQL

SELECT * FROM wilsearch1 WHERE MATCH (TABLE wilsearch1) AGAINST ('col1:te?t');

SQL

SELECT * FROM wilsearch1 WHERE MATCH (TABLE wilsearch1) AGAINST ('col1:te*t');

Boosting a Term

Boosting a search term means increasing the relevance or importance of that search term in the search results. To boost a search term, use the caret ("^") symbol with a boost factor (a number) at the end of the search term. The higher the boost factor, the more relevant the search term will be.

For example, if you are searching for Single Store and you want the term Store to be more relevant, boost it using the ^ symbol along with the boost factor next to the term, such as you could type Single Store^4. This will make rows with the term Store appear more relevant.

You can also boost phrase as in the example: "Single Store"^4 "MySQL".

Note

The boost factor is 1 by default. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2), which leads to the term or phrase have less relevance.

The following examples demonstrate how these wildcard searches appear in sample queries:

SQL

SELECT * FROM bstsearch1 WHERE MATCH (TABLE bstsearch1) AGAINST ('col1:"Single Store"^4 "MySQL"');

SQL

SELECT * FROM bssearch1 WHERE MATCH (TABLE bstsearch1) AGAINST ('col1:SingleStore MySQL^0.02');

Operators in Full-Text Search

Full-text version 2 supports operators listed on the Java Lucene full-text search string syntax page.

Grouping Terms in a Query

Grouping Single Terms

Parentheses can be used to group terms to form subqueries. This can be useful for controlling the boolean logic for a query.

For example, to search for either "Single" or "Store" and "MemSQL", use the query (Single OR Store) AND MemSQL. This ensures that "MemSQL" exists with either the"Single" or "Store" search terms.

Grouping Multiple Terms into a Single Field

Parentheses can be used to group multiple clauses into a single field. The following query can be used to search for "SingleStore" in the col1 full-text column or for "MemSQL" in the col2 full-text column.

SQL

SELECT * FROM grpsearch1 WHERE MATCH (TABLE grpsearch1) AGAINST ('col1:SingleStore OR col2:MemSQL');

Escaping Special Characters

Special characters that are part of the query syntax must be escaped to directly match them. These special characters are:

SQL

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \ /

Use the \ character before the special character to escape it. For example, to search for (1+1):2 ,use the query $1\+1$\:2.

Regular Expression Searches

Regular expression searches that match a pattern between forward slashes ("/") are supported. For example, to find rows containing "moat" or "boat", use /[mb]oat/:

SQL

SELECT * FROM rexsearch1 WHERE MATCH (TABLE rexsearch1) AGAINST ('col1:/[mb]oat/');

The following example searches for dates with the format of month/day/year where the month and day can have one or two digits and year can have two or four digits.

SQL

SELECT * FROM rexsearch1 WHERE MATCH (TABLE rexsearch1) AGAINST ('col1:/\d{1,2}\/\d{1,2}\/\d{2,4}/');

Fuzzy Searches

Fuzzy searches based on the Damerau-Levenshtein Distance are supported. To perform a fuzzy search, use the tilde ("~") symbol at the end of a single term. For example, to search for a term similar in spelling to "roam" use the fuzzy search: roam~, which will find terms like "foam" and "roams".

An edit is a change made to a search term to transform it into another string. An optional “edit” parameter can be used after the tilde to specify the maximum number of edits permitted. This value can be 0, 1, or 2. If not specified, the default value is 2 edits.

Note

Only integer (non-fractional) values are permitted.

The following is an example of a fuzzy search of the term "roam" without the optional parameter:

SQL

SELECT * FROM fuzsearch1 WHERE MATCH (TABLE fuzsearch1) AGAINST ('col1:roam~');

The following is an example of a fuzzy search of the term "roam" with the optional parameter:

SQL

SELECT * FROM fuzsearch1 WHERE MATCH (TABLE fuzsearch1) AGAINST ('col1:roam~1');

Fuzzy Search OPTIONS Clause

Fuzzy searches can be augmented by using an optional OPTIONS clause within the MATCH syntax. The syntax for adding options is as follows:

SQL

MATCH (TABLE <table_name>) AGAINST (<col_name$key.path:search_term> OPTIONS <json_options>);

The set of supported options are:

fuzzy_prefix_length: This is the number of characters at the start of a search term that must be identical (not fuzzy) to the query term if the query is to match the search term. The default value is 0.
fuzzy_max_expansions : This is the maximum number of terms to match against. The default value is 50.
fuzzy_transpositions: This allows transpositions with adjacent characters. This can capture some errors more efficiently, particularly mistyped adjacent characters. When transposition is allowed, matching against "ca~" will be returned "ac" also. The default value is TRUE.

All of the options are case-sensitive, optional, and can be combined as required. For example:

SQL

SELECT * FROM articles WHERE MATCH (TABLE articles) AGAINST ('body:roam~' OPTIONS '{"fuzzy_prefix_length": 2}');

SQL

SELECT * FROM articles 
    WHERE MATCH (TABLE articles) AGAINST ('body:roam~' OPTIONS '{"fuzzy_prefix_length": 1, "fuzzy_transpositions": false}');

Proximity Searches

Finding words within a specified distance away is supported. To perform a proximity search, use the tilde ("~") symbol at the end of a search phrase. For example, to search for a "Single" and "Store" within 10 words of each other in a row, use the search "Single Store"~10:

SQL

SELECT * FROM prxsearch1 WHERE MATCH (TABLE prxsearch1) AGAINST ('col1:"Single Store"~10');

Range Searches

Range searches match rows where the column(s) values are between the lower and upper bound specified by the query. Range searches can be inclusive or exclusive of the upper and lower bounds. Sorting is done lexicographically or by value depending on whether the search is over string or numeric values.

Square brackets ("[]") indicate an inclusive search, curly brackets ("{}") indicate an exclusive search. A range clause of [123 TO 345] implies a numeric range search, a range clause of [abc TO def] implies a lexicographic range search, and [$123$ TO $345$] forces a lexicographic range despite the numeric range values. Specific examples are provided below.

An inclusive range search matches all rows where the values are between the search terms, including values equal to the search terms. Inclusive range queries are denoted by square brackets (“[ ]”). For example, the following query searches for rows where the numeric value of col1 is 2008 through 2020:

SQL

SELECT * FROM rngsearch1 
WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:[2008 TO 2020]');

An exclusive range search matches all rows where the values are between the search terms, but are not equal to the search terms. Exclusive range queries are denoted by curly brackets (“{ }”). For example, the following query searches for rows where the numeric value of col1 is between 2008 and 2020, but is not equal to 2008 or 2020:

SQL

SELECT * FROM rngsearch1 
WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:{2008 TO 2020}');

Range searches may be lexicographic or numeric. Typically searches over strings are expected to order lexicographically while searches over numeric values are expected to order by numeric value. The type of range search to use (lexicographic or numeric) is determined by the values entered in the range clause in the query. If both the upper and lower bounds of the range are numeric, then a numeric range search will be performed. If either value is not numeric, then a lexicographic query will be used.

The following query will use a numeric search as both the values 2008 and 2020 are numeric.

SQL

SELECT * FROM rngsearch1 
WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:{2008 TO 2020}');

The following query will use a lexicographic search as both of the range values are strings that do not contain numeric values. Note that this query is inclusive of A, but exclusive of B.

SQL

SELECT * FROM rngsearch1 
WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:[A TO B}');

This query will use a numeric search as both the values "2008" and "2020" can be parsed numerically.

SQL

SELECT * FROM rngsearch1 
WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:{"2008" TO "2020"}');

It is possible to force a lexicographic search on a numeric-style range clause. To do so, surround the numeric upper and lower bounds with the '$' character. The following query will use a lexicographic search.

SQL

SELECT * FROM rngsearch1 
WHERE MATCH (TABLE rngsearch1) AGAINST ('col1:[$2008$ TO $2020$]'

`ORDER BY … LIMIT` Optimization

Full-text search queries can take advantage of an optimization that pushes down the value of LIMIT in an ORDER BY … LIMIT query into the full-text index search. This optimization will reduce the number of results returned from the full-text index search which can increase query performance.

For this optimization to work, the query must follow these rules:

The WHERE clause must either be empty or the expression in the WHERE clause must be the same full-text expression that is in the ORDER BY clause.
The ORDER BY clause must use DESC sort order.
There must be only one full-text search function (MATCH, BM25, or BM25_GLOBAL) in the query.

Refer to Example ORDER BY … LIMIT and Full-text ORDER BY LIMIT Optimization for examples.

Relevancy Score

The relevancy score of an expression in a MATCH statement denotes the ranking of the expression based on the following factors:

Number of times an expression appears in a column. More occurrences of an expression in the matched column(s) increases its relevancy score.
Rarity of the expression. Rare words have a higher relevancy score than commonly used words.
The length of the column containing the expression. A column with a short expression has a higher relevancy score than a column with a long expression.

BM25 Scoring

Full-text search version 2 supports BM25 scoring for. Refer to BM25 for more information.

Status

The Alloc_fts2_svc status variables is the number of allocated bytes (out of the total max_memory) that is attributable to the next-generation (VERSION 2) full-text subprocess.

View the value of Alloc_fts2_svc on the leaves with the SHOW LEAF STATUS EXTENDED command as shown below.

SQL

SHOW LEAF STATUS EXTENDED LIKE  leaf status extended like '%fts%';

The value of Alloc_fts2_svc on the aggregators can be viewed with the SHOW STATUS EXTENDED command. The Alloc_fts2_svc variable will have a value on the aggregators only when a BM25_GLOBAL search function is used. Thus, the command below will return results only if a BM25_GLOBAL search has been used.

SQL

SHOW STATUS EXTENDED LIKE '%fts%';

Configurations

A set of global variables is available to configure full-text search version 2. Refer to the full-text variables sections of List of Engine Variables for details. Also, refer to Configuring Full Text and Vector Indexes for information on configuring the engine for high performance for full-text indexes.

Note

If your system is experiencing high load due to full-text index builds, SingleStore recommends reducing the value of fts2_max_connections to 16 or 8. This change will reduce load, but will slow down indexing, but will reduce load.

Examples

Create a Version 2 Full-Text Search Index and Query Using MATCH

This example creates a FULLTEXT index for both the title column and the body column. Either column can be queried separately using MATCH (TABLE <table_name>) AGAINST (<expression>), and the index on the column will be applied.

SQL

CREATE TABLE articles (
    id INT UNSIGNED,
    year int UNSIGNED,
    title VARCHAR(200),
    body TEXT,
    SORT KEY (id),    
    FULLTEXT USING VERSION 2 art_ft_index (title, body));

SQL

INSERT INTO articles (id, year, title, body) VALUES
   (1, 2021, 'Introduction to SQL', 'SQL is a standard language for accessing and manipulating databases.'),
   (2, 2022, 'Advanced SQL Techniques', 'Explore advanced techniques and functions in SQL for better data manipulation.'),
   (3, 2020, 'Database Optimization', 'Learn about various optimization techniques to improve database performance.'),
   (4, 2023, 'SQL in Web Development', 'Discover how SQL is used in web development to interact with databases.'),
   (5, 2019, 'Data Security in SQL', 'An overview of best practices for securing data in SQL databases.'),
   (6, 2021, 'SQL and Data Analysis', 'Using SQL for effective data analysis and reporting.'),
   (7, 2022, 'Introduction to Database Design', 'Fundamentals of designing a robust and scalable database.'),
   (8, 2020, 'SQL Performance Tuning', 'Tips and techniques for tuning SQL queries for better performance.'),
   (9, 2023, 'Using SQL with Python', 'Integrating SQL with Python for data science and automation tasks.'),
   (10, 2019, 'NoSQL vs SQL', 'A comparison of NoSQL and SQL databases and their use cases.'),
   (11, 2020, 'Real-time Data Analysis', 'An introduction to real-time analytics.'),
   (12, 2021, 'Analysis for Beginners', 'Simple examples of real time analytics.'),
   (13, 2023, 'Data-Dictionary Design', 'Create and maintain effective data dictionaries.'),
   (14, 2024, 'Scalable Performance', 'Designing for scalability.');

OPTIMIZE TABLE articles FLUSH;

Search for a Single Word

Search for rows with the word database in the body column.

SQL

SELECT * 
FROM articles 
WHERE MATCH (TABLE articles) AGAINST ('body:database');

+----+------+---------------------------------+-----------------------------------------------------------------------------+
| id | year | title                           | body                                                                        |
+----+------+---------------------------------+-----------------------------------------------------------------------------+
| 7  | 2022 | Introduction to Database Design |	Fundamentals of designing a robust and scalable database.                   |
| 3  | 2020 | Database Optimization	      | Learn about various optimization techniques to improve database performance.|
+----+------+---------------------------------+-----------------------------------------------------------------------------+

Boolean OR Search

Search for rows with the word Database in the title column or the phrase Business Intelligence in the body column.

SQL

SELECT title 
FROM articles 
WHERE MATCH (TABLE articles) AGAINST ('title:Database OR body:("Business Intelligence")');

+---------------------------------+
| title                           |     
+---------------------------------+
| Introduction to Database Design |
| Database Optimization           |
+---------------------------------+

Boolean AND Search

Search for rows with the word SQL in the title column and the phrase "Data Security" in the title column.

SQL

SELECT title 
FROM articles 
WHERE MATCH (TABLE articles) AGAINST ('title:SQL AND title:("Data Security")');

+----------------------+
| title                |     
+----------------------+
| Data Security in SQL |
+----------------------+

Multiple MATCH Clauses

Use two MATCH clauses to search for rows with the word SQL in the title or the phrase Business Intelligence in the title, and that also have the word development in the body. In this example, the + indicates that the search term is required.

SQL

SELECT title
    FROM articles  
    WHERE MATCH (TABLE articles) AGAINST ('title:SQL OR body:("Business Intelligence")')
    AND MATCH (TABLE articles) AGAINST ('body:web+');

+------------------------+
| title                  |     
+------------------------+
| SQL in Web Development |
+------------------------+

Use +, *, and ?

Search for rows that have a word starting with Data and followed by an arbitrary number of characters in the title, or words like function followed by a single character (e.g. functions, but not functional) in the title. In this example, the + indicates that Data should appear at the beginning of the word.

SQL

SELECT title
    FROM articles  
    WHERE MATCH (TABLE articles) AGAINST ('title:(+Data*) OR title:function?');

+---------------------------------+
| title                           |     
+---------------------------------+
| Introduction to Database Design |
| SQL and Data Analysis           |
| Database Optimization           |
| Data Security in SQL            |
| Data-Dictionary Design          |
| Real-time Data Analysis         |
+---------------------------------+

Create a Version 2 Full-Text Index over JSON

A full-text index can be created over a JSON column in the same manner it can be created over any other text-type column.

SQL

CREATE TABLE ft_records (
    id INT UNSIGNED,
    title VARCHAR(200),
    records JSON,
    SORT KEY(id),
    FULLTEXT USING VERSION 2 rec_ft_index (title, records));

The full-text index is created over a JSON column by concatenating all leaf string values in the JSON as a multi-valued field. The engine variable fts2_position_increment_gap defines the logical spacing between concatenated leaf string values to prevent matching across different leaf string values. The default value is 100. In addition to having a field for each column in the full-text index, there will also be additional fields for each unique keypath in the JSON document.

SQL

INSERT INTO ft_records VALUES (
1, 
'document', 
'{
         	 	"k1": "cucumber",
         	 	"k2": ["dragonfruit", "eggplant"],
          		"k3": [
              			{"k3_1": "fig", "k3_2": "grape"},
              			{"k3_1": ["huckleberry", "iceberg lettuce"]},
              			"jicama"
          		      ]
 }');

OPTIMIZE TABLE ft_records FLUSH;

SQL

SELECT title, records 
FROM ft_records 
WHERE id = 1;

+----------+--------------------------------------------------------------------------------------------------------------------------------------------+ 
| title	   | records                                                                                                                                    |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------+
| document | {"k1":"cucumber","k2":["dragonfruit","eggplant"],"k3":[{"k3_1":"fig","k3_2":"grape"},{"k3_1":["huckleberry","iceberg lettuce"]},"jicama"]} | 
+----------+--------------------------------------------------------------------------------------------------------------------------------------------+

The JSON document for the records column for the row inserted above is shown below.

JSON

{
	"title": "document",
	"records": ["cucumber", "dragonfruit", "eggplant",
                   "fig", "grape", "huckleberry", 
          	      "iceberg lettuce", "jicama"],
    	"records$k1": "cucumber",
    	"records$k2": ["dragonfruit", "eggplant"],
   	"records$k3": "jicama",
    	"records$k3.k3_1": ["fig", "huckleberry", "iceberg lettuce"],
    	"records$k3.k3_2": "grape"
}

Here the records column is a multi-valued field of all leaf string values in JSON. This field can be queried like any other full-text indexed field. An exception is that matching will not occur over separate values unless the maximum number of positions allowed between matching phrases, or the “slop,” exceeds the value of fts2_position_increment_gap.

Along with records, other fields like records$k1 and records$k3.k3_1 are created to allow searching at each unique keypath present in the JSON document. The dollar sign ("$") is used as a delimiter between the SQL column name and the JSON keypath.

Query over a JSON column

Querying over the entire JSON column can be performed in the same way as with any other column that is part of the full-text index.

SQL

SELECT (MATCH (TABLE ft_records) AGAINST ('records:/.*cumber/')) AS cumber
FROM ft_records;

+----------+ 
| cumber   |
+----------+ 
| 1        | 
+----------+

Query over a JSON keypath

The following example shows how you can search for the string fig at the keypath k3.k3_1 in the records column using the field grouping syntax.

SQL

SELECT (MATCH (TABLE ft_records) AGAINST ('records$k3.k3_1:fig')) AS fig
FROM ft_records;

+------------------------+ 
| fig                    |
+------------------------+ 
| 0.13076457381248474    | 
+------------------------+

Query for Two Terms

Enclose the terms cucumber and raspberry in parentheses to do a boolean OR search for the terms (cucumber and raspberry) in the document. The document matches because the term cucumber appears in the document.

SQL

SELECT id
FROM ft_records
WHERE MATCH (TABLE ft_records) AGAINST ('records:(cucumber raspberry)');

+------+ 
| id   |
+------+ 
| 1    | 
+------+

Phrase Query and Proximity Search

In the example below, quotes are placed around the words cucumber and dragonfruit to search for the phrase "cucumber dragonfruit".

SQL

SELECT id 
FROM ft_records 
WHERE MATCH (TABLE ft_records) AGAINST ('records:"cucumber dragonfruit"');

Empty set (0.008 sec)

The phrase "cucumber dragonfruit" does not match the JSON document in the ft_records table because the words cucumber and dragonfruit belong to different leaf strings in that document.

When a proximity search query for the phrase "cucumber dragonfruit" with a slop of ~100 is used, the JSON document matches.

SQL

SELECT id
FROM ft_records
WHERE MATCH (TABLE ft_records) AGAINST ('records:"cucumber dragonfruit"~100');

+------+
| id   | 
+------+ 
| 1    | 
+------+

Slop indicates the maximum number of words allowed between words in a phrase for the phrase to be considered a match. That is, a slop of ~100 means if there are 100 words or less between cucumber and dragonfruit in the document, the document will be considered a match.

Further, a proximity search is only done on multi-valued fields such as JSON (or BSON) columns when the slop value is greater than or equal to fts2_position_increment_gap.

In this example slop is 100 and fts2_position_increment_gap=100 (100 is the default value of fts2_position_increment_gap). Since slop is equal to fts2_position_increment_gap matching across separate values occurs. And since the words cucumber and dragonfruit are within 100 words of each other in the document, a match is returned.

Use of ~ in Fuzzy Search and Proximity Search

As described above, the ~ symbol is used for both proximity searches and fuzzy searches. If the ~ appears after a phrase (multiple words delimited by quotes "), this indicates that a proximity search with a slop value of the number appearing after the ~ should be performed. If the ~ appears after a single word, this indicates that an edit distance (Levenshtein) comparison should be performed on that word (also called a fuzzy search).

The following example shows a fuzzy search.

SQL

SELECT id
FROM ft_records
WHERE MATCH (TABLE ft_records) AGAINST ('records:dronfruit~2');

+------+
| id   |
+------+
|    1 |
+------+

A match is returned because dronfruit, which is intentionally misspelled, is within an edit distance of 2 of dragonfruit.

Phrase Search vs. Boolean Term Search

It is important to understand the difference between a phrase search and a boolean term search as it relates to JSON fields.

Words in quotes (like "cucumber dragonfruit") are searched for as a single phrase in a single JSON field. Words that appear in parentheses (like (cucumber dragonfruit) are searched as if there is a boolean OR between the words. When terms are separated by logical operators, they may be matched in different JSON fields.

In the following example, cucumber and dragonfruit appear in fields k1 and k2 respectively, and thus a match occurs.

SQL

SELECT id
FROM ft_records
WHERE MATCH (TABLE ft_records) AGAINST ('records:(cucumber dragonfruit)');

+------+
| id   |
+------+
|    1 |
+------+

Example - `ORDER BY … LIMIT`

The query below will take advantage of the ORDER BY...LIMIT pushdown. In this query there is one MATCH function, named match_res, the columns in the WHERE and ORDER BY clauses are the same, and the sort order is DESC.

SQL

SELECT id, title, 
   MATCH (TABLE articles) AGAINST ('body:database') AS match_res
FROM articles 
WHERE match_res
ORDER BY match_res DESC LIMIT 25;

Index Repair

Full-text index creation failure is rare. However, if full-text index creation fails, you will receive the error ER_FTS_INDEX_NEEDS_REPAIR_ON_SEGMENT. The index can be repaired by running OPTIMIZE TABLE <tablename> FIX_FULLTEXT.

Legacy (Version 1) SingleStore Process

SingleStore's legacy full-text search system uses a CLucene service which is embedded in the SingleStore database engine.

Full-text indexes are only supported on columnstore tables. They can only be enabled as part of a CREATE TABLE statement using the FULLTEXT index type. This means full-text indexes cannot be dropped or altered after the table is created. If the table is dropped, then the index is deleted automatically.

SQL

CREATE TABLE <table_name> (FULLTEXT [<fts_index_name>] (<fts_col>))

Content in columns that are full-text indexed can be searched using the MATCH function. Each MATCH clause applies to only one table. To search against multiple tables, specify multiple MATCH clauses.

Upgrade to Full-Text Version 2 describes how to upgrade to full-text VERSION 2.

Note

New inserts and updates into columnstore tables may initially be stored in a hidden rowstore table before being flushed to a segment file. The affected segment is re-indexed when the background flusher runs.

In that case, the full-text index in the columnstore will be updated asynchronously for new inserts and updates. Inserts and updates from this rowstore table can be force-pushed to the columnstore table by using the OPTIMIZE TABLE <table_name> FLUSH command.

Since an index is created for each segment file, the distribution of words within the segment may affect the score of full-text queries, especially when the segments have very few rows and the columns have very few words.

The following example illustrates how to create a table with a legacy version full-text search index and how to query from that table. The USING VERSION 1 syntax is optional.

SQL

CREATE TABLE articles (
    id INT UNSIGNED,
    year int UNSIGNED,
    title VARCHAR(200),
    body TEXT,
    SORT KEY (id),
    FULLTEXT USING VERSION 1 (title, body));

SQL

SELECT * FROM articles
    WHERE MATCH (title,body)
    AGAINST ('database');

Refer to the MATCH page for more examples.

Working with Vector Data- Allows for semantic searching, which is searching based on meanings, not keywords.
Hybrid Search - Allows full-text and vector search methods in one query. Full-text and vector search ranking can be combined.
Configuring Full Text and Vector Indexes
BM25 and BM25_GLOBAL
MATCH
HIGHLIGHT - HIGHLIGHT is not supported in VERSION 2.
Training: Full-Text Index and Search

Working with Full-Text Search

On this page

Overview

Version 2 SingleStore Process

Create, Add, and Drop Full-Text Indexes

Create a Version 2 Full-Text Index

Drop a Full-Text Index

Remarks

Upgrade to Full-Text Version 2

Query Full-Text Indexes

Remarks

Terms in Full-Text Searches

Wildcard Support

Boosting a Term

Operators in Full-Text Search

Grouping Terms in a Query

Grouping Single Terms

Grouping Multiple Terms into a Single Field

Escaping Special Characters

Regular Expression Searches

Fuzzy Searches

Fuzzy Search OPTIONS Clause

Proximity Searches

Range Searches

ORDER BY … LIMIT Optimization

Relevancy Score

BM25 Scoring

Status

Configurations

Examples

Create a Version 2 Full-Text Search Index and Query Using MATCH

Search for a Single Word

Boolean OR Search

Boolean AND Search

Multiple MATCH Clauses

Use +, *, and ?

Create a Version 2 Full-Text Index over JSON

Query over a JSON column

Query over a JSON keypath

Query for Two Terms

Phrase Query and Proximity Search

Use of ~ in Fuzzy Search and Proximity Search

Phrase Search vs. Boolean Term Search

Example - ORDER BY … LIMIT

Index Repair

Legacy (Version 1) SingleStore Process

Related Topics

Was this article helpful?

On this page

Was this article helpful?

`ORDER BY … LIMIT` Optimization

Example - `ORDER BY … LIMIT`