Hybrid Search - Re-ranking and Blending Searches

Hybrid Search combines multiple search methods in one query. Hybrid search can blend full-text search (which finds keyword matches) and vector search (which finds semantic matches) allowing search results to be (re-)ranked by a score that combines full-text and vector rankings. This process allows you to combine the benefits of fine-grained control provided by full-text search with semantic vector search. Further, SingleStore has Vector Indexing and full-text indexing (Working with Full-Text Search) which can be used to speed up a hybrid search.

Example 1 provides a simple example of how hybrid search can combine full-text search, vector search, and a SQL filter. Example 2 shows an indexed hybrid search over data taken from Wikipedia pages.

Example 1 - Hybrid Search of Comments

This example demonstrates a hybrid search over a comments table. The example combines: a semantic search of comments that relate to a user's request, full-text search to prioritize comments with a specific keyword, and a SQL filter to restrict to comments in the "Food" category.

Consider a table of comments that contains:

  • id – an INT id

  • comment – a TEXT field which stores the text of the comment

  • comment_embedding –  a VECTOR, which stores a vector embedding that captures the meaning of the comment as a vector, as described in Working with Vector Data.

  • category – a VARCHAR which stores a category assigned to the comment

Create the comments table and insert three rows into that table.

CREATE TABLE comments(id INT,   
comment TEXT,   
comment_embedding VECTOR(4) not null,   
category VARCHAR(256));
INSERT INTO comments VALUES       
(1, "The cafeteria in building 35 has a great salad bar",    
'[0.45, 0.55, 0.495, 0.5]',
    "Food"),
(2, "I love the taco bar in the B16 cafeteria.",     
   '[0.01111, 0.01111, 0.1, 0.999]',
    "Food"),
(3, "The B24 restaurant salad bar is quite good.",    
    '[0.1, 0.8, 0.2, 0.555]',
"Food");

Add a full-text index to the table and run OPTIMIZE TABLE to ensure the values above are indexed.

ALTER TABLE comments ADD FULLTEXT ft_comment(comment);
OPTIMIZE TABLE comments FULL;

The SQL query below first creates a query vector (@query_vec). The query searches for comments that are similar to that query. Specifically, the query blends full-text search to re-rank comments based on a combination of a full-text index score and a semantic vector search score. In addition, the query has a filter to select only comments in the "Food" category.

The @query_vec variable is cast to a VECTOR to ensure that @query_vec is a valid VECTOR and to improve performance.

To try out SQL queries, click the Playground icon to the right of the SQL listing.

SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4);
SELECT id, SUBSTRING(comment,0,20) AS comment,
FORMAT(@query_vec <*> comment_embedding,4) AS vec_score,
FORMAT(MATCH(comment) AGAINST ("restaurant"),4) AS ft_score,
FORMAT((vec_score + ft_score) / 2,4) AS combined_score
FROM comments
WHERE category = "Food"
ORDER BY combined_score DESC
LIMIT 2;
+------+----------------------+-----------+----------+----------------+
| id   | comment              | vec_score | ft_score | combined_score |
+------+----------------------+-----------+----------+----------------+
|    3 | The B24 restaurant s | 0.8993    | 0.1151   | 0.5072         |
|    1 | The cafeteria in bui | 0.9810    | 0.0000   | 0.4905         |
+------+----------------------+-----------+----------+----------------+

In the results, the row with id 3 has the word "restaurant" in it, so it gets a boost from the full-text score in this query and is the top-ranked result. That is, in these results, the row with id 3 is ranked above the comment with id 1. In contrast, if only the vector score was considered, the comment with id 1 would be ranked above the comment with id 3 as is shown in the Example Search Based on Vector Similarity in Working with Vector Data.

Also note that in this query, the comment string and the scores have been truncated for readability of the result.

The availability of vector search, full-text search, and standard SQL operations in SingleStore allows you to combine them as shown above.

Example 2 - Hybrid Search of Wikipedia Articles

SingleStore provides full-text search (Working with Full-Text Search) and Vector Indexing which can be used to speed up a hybrid search. This example searches over a dataset of Wikipedia articles to show how these indexes can be used in hybrid searches. The example is also described in this blog post on Hybrid Search.

SingleStore has created a dataset of Wikipedia articles. This data set contains 160M vectors and associated paragraphs to simulate vector-based semantic search over all 6.7 million articles in Wikipedia. The data set has real data for video games, and the rest of the data is mocked up. The full data set, including how to load and generate the vectors, is described in this blog on ANN search. The data set is available under the Creative Commons Attribution-ShareAlike License 4.0.

The example below uses just the data for video games, which is approximately 41,000 vectors based on 1,800 articles from Wikipedia.

Use the following SQL commands to run this example or try out this example in a SingleStore Notebook: Hybrid Full-text and Vector Search.

First, create a table to hold the vectors and paragraphs.

CREATE TABLE vecs(
id BIGINT(20),
url TEXT DEFAULT NULL,
paragraph TEXT DEFAULT NULL,
v VECTOR(1536) NOT NULL,
SHARD KEY(id), KEY(id) USING HASH
);

Load data into this table using the following pipeline.

-- since the bucket is open, you can leave the credentials clause as-is
CREATE OR REPLACE PIPELINE wiki_pipeline AS
load data S3
's3://wikipedia-video-game-data/video-game-embeddings(1).csv'
config '{"region":"us-west-1"}'
credentials '{"aws_access_key_id": "", "aws_secret_access_key": ""}'
SKIP DUPLICATE KEY ERRORS
INTO TABLE vecs
FORMAT csv
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n';
START PIPELINE wiki_pipeline FOREGROUND;

Run full-text and vector indexes on the table. Creating the indexes after data loading is more efficient in the index build process than creating indexes before data loading.

ALTER TABLE vecs ADD FULLTEXT ft_para(paragraph);
ALTER TABLE vecs ADD VECTOR INDEX ivf_v(v)
INDEX_OPTIONS '{"index_type":"IVF_FLAT"}';

Run the following command to optimize the index searches and to ensure the values above are indexed.

OPTIMIZE TABLE vecs FULL;

A hybrid search query that uses both full-text and vector indexes can be constructed this way:

  • Create a subquery that uses vector search

  • Create a subquery that uses keyword search for similar content

  • Join the two with a full outer join

  • Produce a final result that combines the vector search score and full-text search score, and re-ranks

The query below first creates a query vector (@v_mario_kart) and then follows this outline and does an indexed hybrid vector and full-text search.

/* Get the vector for the first paragraph about the Mario Kart Game.
It's a good semantic query vector for Mario Kart. */
SET @v_mario_kart = (SELECT v FROM vecs
WHERE URL = "https://en.wikipedia.org/wiki/Super_Mario_Kart"
ORDER BY id LIMIT 1);
WITH fts AS (
SELECT id, paragraph,
MATCH(paragraph) AGAINST("mario kart") AS SCORE
FROM vecs
WHERE MATCH(paragraph) AGAINST("mario kart")
ORDER BY SCORE desc
LIMIT 200
),
vs AS (
SELECT id, paragraph, v <*> @v_mario_kart AS SCORE
FROM vecs
ORDER BY score DESC
LIMIT 200
)
SELECT vs.id, SUBSTRING(vs.paragraph,0,25),
FORMAT(IFNULL(fts.score, 0) * .3
+ IFNULL(vs.score, 0) * .7, 4) AS score,
FORMAT(fts.score, 4) AS fts_s,
FORMAT(vs.score, 4) AS vs_s
FROM fts FULL OUTER JOIN vs ON fts.id = vs.id
ORDER BY score DESC
LIMIT 5;
+------------------+------------------------------+--------+--------+--------+
| id               | SUBSTRING(vs.paragraph,0,25) | score  | fts_s  | vs_s   |
+------------------+------------------------------+--------+--------+--------+
| 1125899906848349 | Super Mario Kart is a kar    | 0.9943 | 0.9811 | 1.0000 |
| 1125899906848372 | Nintendo re-released Supe    | 0.9353 | 1.0000 | 0.9075 |
| 2251799813690085 | Mario Kart DS was the bes    | 0.9239 | 1.0000 | 0.8912 |
| 1125899906848375 | Several sequels to Super     | 0.9144 | 0.8918 | 0.9240 |
| 1125899906855047 | Mario Kart: Super Circuit    | 0.9130 | 0.9521 | 0.8962 |
+------------------+------------------------------+--------+--------+--------+

The final ranked result (by score) is different from the order for the full-text score (fts_s) or the vector similarity score (vs_s). So both the full-text search and the vector search have an influence on the result.

SQL is used here in a relatively compact way. You can combine full-text and vector search and re-rank in one query rather than having to use an application program to combine results from a vector database, a full-text database, and a relational database.

For more details on this example and performance results see our blog on ANN search.

Last modified: November 21, 2024

Was this article helpful?