BM25
On this page
The BM25_
and BM25
functions are used to employ BM25 scoring for VERSION 2
full-text search queries.
For columnstore tables created with a VERSION 2 FULLTEXT
index, the text columns in that table can be searched using the BM25_
and BM25
functions.BM25_
function scores all rows in a table together, while the BM25
function provides partition-level scoring.BM25_
, collection and term statistics are calculated for a table, ensuring accurate scores relative to all rows in a table.BM25
, collection and term statistics are calculated within a partition, ensuring accurate scores relative to a row's placement within a partition.
The BM25_
function is more accurate than the BM25
function and is also more expensive.BM25
the accuracy of the scores depends on how data is distributed within each partition.BM25
will produce accurate scores that are close to BM25_
.
Queries using the MATCH
syntax are scored using BM25 on a segment level.BM25
and BM25_
return more accurate results than MATCH
but are less efficient.
Thus, the functions MATCH
, BM25
, and BM25_
provide BM25 scores with a tradeoff between accuracy and performance with MATCH
as the least accurate, but most efficient, and BM25_
as the most accurate and most expensive.
The result of the BM25
, BM25_
, and MATCH
functions is a relevancy score.
Important
To include recent inserts/updates from in the results, run OPTIMIZE TABLE tbl_
before running your query.
Refer to Working with Full-Text Search for conceptual information on full-text search.
Syntax
The syntax for BM25
and BM25_
scoring is as follows:
BM25(<table_name>, <expression>)BM25_GLOBAL(<table_name>, <expression>)
The USING VERSION 2
syntax must be used in the CREATE TABLE
command to utilize BM25
and BM25_
.VERSION 2
.
BM25 full-text search can be customized for languages other than English; refer to Full Text VERSION 2 Custom Analyzers for more information.
Each BM25
or BM25_
clause applies to only one table.BM25
or BM25_
clauses to search against multiple tables.
Operators
A set of operators from Java Lucene full-text search string syntax is supported.<expression>
consists of a list of one or more columns, followed by a mix of text with zero or more of the following operators.
Operator |
Description |
---|---|
(no operator) |
When no operator is specified, the word is optional; however, the rows that contain it are rated higher. |
+ |
A leading plus sign indicates that this word must be present in each row returned. |
- |
A leading minus sign indicates that this word must not be present in any of the rows that are returned. |
NOT |
The |
AND |
The |
OR |
The |
() |
Parentheses group words into subexpressions. |
"" |
A phrase is a quoted string with two or more space-delimited terms. A phrase that is enclosed within double quote (") characters matches the words in the quotes as if it is a single word. |
~<proximity> |
A |
Stopwords
Certain words are ignored by full-text search due to their commonality resulting in less relevant results.
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not,
of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with
Customized stopword lists can be used with custom analyzers.
Special Characters
The default tokenizer for full-text search removes special characters from the search.
+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
Custom tokenizers can be used to search for strings that include those characters.
Relevancy Score
The relevancy score of an expression in a BM25
or BM25_
statement denotes the ranking of the expression based on the following factors:
-
Number of times an expression appears in a column.
More occurrences of an expression in the matched column(s) increases its relevancy score. -
Rarity of the expression.
Rare words have a higher relevancy score than commonly used words. -
The length of the column containing the expression.
A column with a short expression has a higher relevancy score than a column with a long expression.
Remarks
-
The column specification in the
<expression>
may contain a JSON keypath.A keypath is an expression containing one or more keys, delimited by dots, that describes the hierarchical path to a field in a JSON document. For example, key1.
would be a valid keypath.key2. key3 -
Refer to MATCH for more examples of full-text search with the above operators.
Examples
Note
The examples in this section use the BM25
function.BM25_
.BM25
and BM25_
is the table-level vs.
Create Example Data
Create a table with a FULLTEXT VERSION 2
index on columns title and body.
Either column (title
or body
) can be queried separately using the BM25
function, and the index on the column will be applied.
CREATE TABLE articles (id INT UNSIGNED,year int UNSIGNED,title VARCHAR(200),body TEXT,SORT KEY (id),FULLTEXT USING VERSION 2 art_ft_index (title, body));
INSERT INTO articles (id, year, title, body) VALUES(1, 2021, 'Introduction to SQL', 'SQL is a standard language for accessing and manipulating databases.'),(2, 2022, 'Advanced SQL Techniques', 'Explore advanced techniques and functions in SQL for better data manipulation.'),(3, 2020, 'Database Optimization', 'Learn about various optimization techniques to improve database performance.'),(4, 2023, 'SQL in Web Development', 'Discover how SQL is used in web development to interact with databases.'),(5, 2019, 'Data Security in SQL', 'An overview of best practices for securing data in SQL databases.'),(6, 2021, 'SQL and Data Analysis', 'Using SQL for effective data analysis and reporting.'),(7, 2022, 'Introduction to Database Design', 'Fundamentals of designing a robust and scalable database.'),(8, 2020, 'SQL Performance Tuning', 'Tips and techniques for tuning SQL queries for better performance.'),(9, 2023, 'Using SQL with Python', 'Integrating SQL with Python for data science and automation tasks.'),(10, 2019, 'NoSQL vs SQL', 'A comparison of NoSQL and SQL databases and their use cases.'),(11, 2020, 'Real-time Data Analysis', 'An introduction to real-time analytics.'),(12, 2021, 'Analysis for Beginners', 'Simple examples of real time analytics.'),(13, 2023, 'Data-Dictionary Design', 'Create and maintain effective data dictionaries.'),(14, 2024, 'Scalable Performance', 'Designing for scalability.');OPTIMIZE TABLE articles FLUSH;
Example 1 - Search for a Term
Search for articles with the word 'database'
in the body.BM25
score for the row is non-zero.
SELECT id, title, body, BM25(articles,'body:database') AS scoreFROM articlesWHERE BM25(articles,'body:database');
+------+---------------------------------+------------------------------------------------------------------------------+---------------------+
| id | title | body | score |
+------+---------------------------------+------------------------------------------------------------------------------+---------------------+
| 7 | Introduction to Database Design | Fundamentals of designing a robust and scalable database. | 0.13076457381248474 |
| 3 | Database Optimization | Learn about various optimization techniques to improve database performance. | 0.3219999074935913 |
+------+---------------------------------+------------------------------------------------------------------------------+---------------------+
Example 2 - Use BM25 Twice in one Query
SELECT titleFROM articlesWHERE BM25(articles, 'title:SQL OR body:("Business Intelligence")')AND BM25(articles,'body:web');
+------------------------+
| title |
+------------------------+
| SQL in Web Development |
+------------------------+
Example 3 - Use BM25 with a SQL Predicate
SELECT count(*)FROM articlesWHERE year = 2021AND BM25(articles,'body:SQL');
+----------+
| count(*) |
+----------+
| 2 |
+----------+
Example 4 - Score over Multiple Columns
Search for the word 'database'
across two columns
SELECT *FROM articlesWHERE BM25(articles,'title:database OR body:database' );
+------+------+---------------------------------+------------------------------------------------------------------------------+
| id | year | title | body |
+------+------+---------------------------------+------------------------------------------------------------------------------+
| 7 | 2022 | Introduction to Database Design | Fundamentals of designing a robust and scalable database. |
| 3 | 2020 | Database Optimization | Learn about various optimization techniques to improve database performance. |
+------+------+---------------------------------+------------------------------------------------------------------------------+
Example 5 - Use a Single Boolean Operator
Boolean operators allow terms to be combined through logic operators.
Note
Boolean operators must be in ALL CAPS.
SELECT titleFROM articlesWHERE BM25(articles,'title:Database OR title:"Business Intelligence"');
+---------------------------------+
| title |
+---------------------------------+
| Introduction to Database Design |
| Database Optimization |
+---------------------------------+
Example 6 - Use Multiple Boolean Operators
The following example returns the title of an article that contains either 'Database'
or 'Business Intelligence'
in the title and the string 'real-time analytics'
in the body.
SELECT titleFROM articlesWHERE BM25(articles,'(title:Database OR title:"Analysis") AND (body:"real-time analytics")');
+-------------------------+
| title |
+-------------------------+
| Analysis for Beginners |
| Real-time Data Analysis |
+-------------------------+
Example 7 - Score over JSON
For this and the following example, create a table named scoring
with a FULLTEXT VERSION 2
index on columns b
and c
.
CREATE TABLE scoring (a INT, b TEXT, c JSON, SORT KEY(a),FULLTEXT USING VERSION 2 scr_ft_index (b, c));INSERT INTO scoring VALUES(1, 'one two three three four four four five five five five five','{"one": "two"}');INSERT INTO scoring VALUES(2, 'one two two three three three three four four four four four four four four','{"one": "three"}');OPTIMIZE TABLE scoring FLUSH;
To obtain BM25
scores over JSON documents at a keypath, use the following query.one:three
.
SELECT a, ROUND(BM25(scoring, 'c$one:three'), 3) AS score_oneFROM scoring;
+-----+-----------+
| a | score_one |
+-----+-----------+
| 1 | 0.000 |
| 2 | 0.131 |
+-----+-----------+
Example 8 - Use Phrases and Proximity Search
Proximity search enables you to find words within a specific distance of each other.~
, symbol with a <proximity>
value at the end of a phrase."Single"
and "Store"
within 10
words of each other in a row use the search: "Single Store"~10
.
The following example finds BM25
scores over phrases using proximity search.
SELECT a, ROUND(BM25(scoring,'b:"two four"'), 3) AS score_two_four, ROUND(BM25(scoring,'b:"two four"~2'), 3) AS score_two_four_proximity_2, ROUND(BM25(scoring,'b:"two four"~4'), 3) AS score_two_four_proximity_4FROM scoring;
+-----+----------------+----------------------------|----------------------------+
| a | score_two_four | score_two_four_proximity_2 | score_two_four_proximity_4 |
+-----+----------------+----------------------------|----------------------------+
| 1 | 0.000 | 0.125 | 0.125 |
| 2 | 0.000 | 0.000 | 0.082 |
|+----+----------------+----------------------------+----------------------------+
Related Topics
Last modified: November 14, 2024