Hybrid Search - Re-ranking and Blending Searches

Combining the syntactic (keyword-based) ranking from full-text text search and the semantic ranking from vector search is often important. For example, if it is significant that a particular keyword or phrase appears in a text chunk, but you do not want to search on keyword or phrase alone and want to also include semantic (meaning-based) search.

Combining full-text and vector search allows you to combine the benefits of fine-grained control provided by full-text search with semantic vector search. Further, SingleStore has Vector Indexing and full-text indexing which can be used to speed up a hybrid search.

Reciprocal Rank Fusion

A common approach to blending results from two or more different rankings (e.g. semantic text search and keyword-based vector search) is reciprocal rank fusion (RRF). RRF scores based on ranking and does not directly rely on full-text text search scores or vector similarity metric scores (e.g. dot product). RRF is thus not dependent on any particular score or magnitude range.

The RRF formula for N different ranked lists where r is the rank of an item in list i, is shown in the following.

In this formula, k is a smoothing factor which smooths out the decline in weights. Without a smoothing factor (k=0) RRF would give weight 1.0 to item ranked 1, weight 0.5 to item ranked 2 and so on. With a smoothing factor of k=60, the item ranked 1 is weighted 1/61 ≈ 0.0164 and the item ranked 2 is weighted 1/62 ≈ 0.0161, creating a much more gradual decline in weights.

You can further refine RRF by weighting each ranking differently. For example, you could weight full-text search results at 0.7 and vector search results at 0.3.

Examples

It is possible, as shown in the following examples below, to combine full-text and vector search rankings with RRF in a SQL query.

Example 1 shows an indexed hybrid search over data taken from Wikipedia pages. Example 2 provides an example of how hybrid search can combine full-text search, vector search, and a SQL filter.

In these examples, CTEs, the ROW_NUMBER window function and general expressions are used to perform powerful hybrid search rankings. You can use user-defined functions (UDFs), such as SingleStore's external functions or Wasm UDFs, to create functions and queries that implement late interaction models, cross-encoders or other reranking techniques.

Example 1 - Hybrid Search of Wikipedia Articles

SingleStore provides full-text search (Working with Full-Text Search) and Vector Indexing which can be used to speed up a hybrid search. This example searches over a dataset of Wikipedia articles to show how these indexes can be used in hybrid searches.

SingleStore has created a dataset of Wikipedia articles. This dataset contains 160M vectors and associated paragraphs to simulate vector-based semantic search over all 6.7 million articles in Wikipedia. The dataset has real data for video games, and the rest of the data is mocked up. The full dataset, including how to load and generate the vectors, is described in this blog on ANN search. The data set is available under the Creative Commons Attribution-ShareAlike License 4.0.

The following example uses just the data for video games, which is approximately 41,000 vectors based on 1,800 articles from Wikipedia.

Use the following SQL commands to run this example.

Create a table to hold the vectors and paragraphs.

SQL

CREATE TABLE vecs(
id BIGINT(20),
     url TEXT DEFAULT NULL,
     paragraph TEXT DEFAULT NULL,
     v VECTOR(1536) NOT NULL,
     SHARD KEY(id), KEY(id) USING HASH
);

Load data into this table using the following pipeline.

SQL

-- since the bucket is open, you can leave the credentials clause as-is
CREATE OR REPLACE PIPELINE wiki_pipeline AS
load data S3
's3://singlestore-docs-example-datasets/wikipedia-video-game-data/video-game-embeddings.csv'
config '{"region":"us-east-1"}'
credentials '{"aws_access_key_id": "", "aws_secret_access_key": ""}'
SKIP DUPLICATE KEY ERRORS
INTO TABLE vecs
FORMAT csv
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n';

START PIPELINE wiki_pipeline FOREGROUND;

Add a full-text index and a vector index on the table. Creating the indexes after data loading is more efficient in the index build process than creating indexes before data loading.

SQL

ALTER TABLE vecs ADD FULLTEXT ft_para(paragraph);

ALTER TABLE vecs ADD VECTOR INDEX ivf_v(v)
    INDEX_OPTIONS '{"index_type":"IVF_FLAT"}';

Run the following command to optimize the index searches and to ensure the values above are indexed.

SQL

OPTIMIZE TABLE vecs FLUSH;

The following query searches for rows with information about Super Mario Kart, by using full-text search to search with paragraphs that match the string super mario kart and that have embedding vectors (v) that are similar to the embedding for the introduction to the Wikipedia Super Mario Kart page.

The variable are set prior to defining the query.

@v_mario_kart to hold the vector embedding for the Mario Kart Wikipedia page.

At a high-level, the query is structured as follows:

Two Common Table Expressions (CTEs) use indexed full-text and vector search to produce ordered lists of rows that best meet the full-text and vector search criteria.
Two additional CTEs create ranked full-text and vector results.
A combined score is calculated using the RRF formula, and the top 5 results are returned.

SQL

SET @v_mario_kart = (SELECT v FROM vecs
          WHERE URL = "https://en.wikipedia.org/wiki/Super_Mario_Kart"
          ORDER BY id LIMIT 1);

WITH fts AS ( /* Find top full-text matches. */
  SELECT
    id,
    paragraph,
    MATCH(paragraph) AGAINST ("super mario kart") AS SCORE
  FROM vecs
    WHERE MATCH(paragraph) AGAINST ("super mario kart")
 ORDER BY SCORE desc
  LIMIT 200
),
vs AS ( /* Find top vector search matches. <*> is dot product. */
  SELECT
    id,
    paragraph,
    v <*> @v_mario_kart AS SCORE
  FROM vecs
  ORDER BY score DESC
  LIMIT 200
),
fts_ranked as (
  SELECT
    *,
    ROW_NUMBER() OVER (ORDER BY SCORE DESC) AS fts_rank
  FROM fts
),
vs_ranked as (
  SELECT
    *,
    ROW_NUMBER() OVER (ORDER BY SCORE DESC) as vs_rank
  FROM vs
)
SELECT
  *,
  0.7 * (1.0 / (NVL(fts_rank, 1000.0) + 60)) + 0.3 * (1.0 / (NVL(vs_rank, 1000.0) + 60)) as combined_score
FROM fts_ranked f FULL OUTER JOIN vs_ranked v ON f.id = v.id
ORDER BY combined_score DESC
LIMIT 5;

In more detail, this query uses the following structure:

The first two CTEs in the query (fts and vs) create ordered lists of 200 rows that best match the full-text and vector search criteria.
- The full-text search block (fts) uses inverted index search based on Java Lucene.
- The vector search block (vs) uses approximate nearest neighbor (ANN) indexing.
The next two CTEs use the ROW_NUMBER window function to create ranked lists (fts_ranked and vs_ranked) of these 200 rows.
- The ranked lists are created in separate CTEs to allow the initial CTEs to use indexed search.
- If the ROW_NUMBER calculation were pushed into the initial full-text and vector and search CTEs (fts and vs), that would prevent those CTEs from using index search.
The final query block uses a FULL OUTER JOIN to ensure that rows for items that appear in one ranked list but not the other are included in the results. Standard INNER JOIN would discard items that are not in both lists.
The SELECT clause calculates the combined_score using RRF and returns the top 5 results with highest combined_score.

This query uses a smoothing factor of 60 and weights full-text results at 0.7 and vector search results at 0.3. NULL ranks — which can arise from the outer join — are given an artificially high rank of 1000 so they don't contribute meaningfully to the score.

Notice that values with high ranks rise to the top (as with the (2,1) rank pair), but a poor ranking on one sorted input doesn't prevent an item from participating in the results at a high level, e.g., the (3,11) pair ranks 3 overall, better than (5,8) which ranks at 4.

Example 2 - Hybrid Search of Comments

This example demonstrates a hybrid search over a comments table. The example combines:

A semantic search using vector search to find comments that relate to a user's request.
Full-text search to prioritize comments with a specific keyword.
A SQL filter that restricts to the "Food" category.

Consider a table of comments that contains:

id – an INT id
comment – a TEXT field which stores the text of the comment
comment_embedding – a VECTOR, which stores a vector embedding that captures the meaning of the comment as a vector, as described in Working with Vector Data.
category – a VARCHAR which stores a category assigned to the comment

Create the comments table and insert three rows into that table.

SQL

CREATE TABLE comments(id INT,
   comment TEXT,
   comment_embedding VECTOR(4) NOT NULL,
   category VARCHAR(256));

SQL

INSERT INTO comments VALUES
(1, "The cafeteria in building 35 has a great salad bar", 
    '[0.45, 0.55, 0.495, 0.5]',
    "Food"),
(2, "I love the taco bar in the B16 cafeteria.",
    '[0.01111, 0.01111, 0.1, 0.999]',
    "Food"),
(3, "The B24 restaurant salad bar is quite good.",
    '[0.1, 0.8, 0.2, 0.555]',
    "Food");

Add a full-text index to the table and run OPTIMIZE TABLE to ensure the values above are indexed.

SQL

ALTER TABLE comments ADD FULLTEXT ft_comment(comment);

OPTIMIZE TABLE comments FLUSH;

The following SQL query searches for comments that match the word restaurant and are similar to a query vector ('[0.44, 0.554, 0.34, 0.62]') which is intended to represent an embedding of the user's request. In addition, search is restricted to comments in the category "Food".

The following variable are set prior to defining the query.

@query_vec to hold the vector embedding from the user's request.

Similar to Example 1, the query is structured as follows:

Two Common Table Expressions (CTEs) use indexed full-text and vector search to produce ordered lists of rows that meet the full-text and vector search criteria.
- Both CTEs include only comments in the category "Food".
Two additional CTEs create ranked full-text and vector results.
A combined score is calculated using the RRF formula, and the top 3 results are returned.

The @query_vec variable is cast to a VECTOR to ensure that @query_vec is a valid VECTOR and to improve performance.

SQL

SET @query_vec = ('[0.44, 0.554, 0.34, 0.62]'):>VECTOR(4);

WITH fts AS ( /* Find top full-text matches. */
  SELECT
    id,
    comment,
    MATCH(comment) AGAINST ("restaurant") AS ft_score
  FROM comments
  WHERE MATCH(comment) AGAINST ("restaurant")
    AND category = "Food"
  ORDER BY ft_score desc
  LIMIT 200
),
vs AS ( /* Find top vector search matches. <*> is dot product. */
  SELECT
    id,
    comment,
    @query_vec <*> comment_embedding AS vec_score
  FROM comments
  WHERE category = "Food"
  ORDER BY vec_score desc
  LIMIT 200),
fts_ranked as (
  SELECT
    *,
    ROW_NUMBER() OVER (ORDER BY ft_score DESC) AS fts_rank
  FROM fts
),
vs_ranked as (
  SELECT
    *,
    ROW_NUMBER() OVER (ORDER BY vec_score DESC) as vs_rank
  FROM vs
)
SELECT
  *,
  0.7 * (1.0 / (NVL(fts_rank, 1000.0) + 60)) + 0.3 * (1.0 / (NVL(vs_rank, 1000.0) + 60)) AS combined_score
FROM fts_ranked f FULL OUTER JOIN vs_ranked v ON f.id = v.id
ORDER BY combined_score DESC
LIMIT 3;

*** 1. row ***
            id: 3
       comment: The B24 restaurant salad bar is quite good.
      ft_score: 0.46706151962280273
      fts_rank: 1
            id: 3
       comment: The B24 restaurant salad bar is quite good.
     vec_score: 0.8993000388145447
       vs_rank: 2
combined_score: 0.016314
*** 2. row ***
            id: NULL
       comment: NULL
      ft_score: NULL
      fts_rank: NULL
            id: 1
       comment: The cafeteria in building 35 has a great salad bar
     vec_score: 0.9810000061988831
       vs_rank: 1
combined_score: 0.005578
*** 3. row ***
            id: NULL
       comment: NULL
      ft_score: NULL
      fts_rank: NULL
            id: 2
       comment: I love the taco bar in the B16 cafeteria.
     vec_score: 0.6644233465194702
       vs_rank: 3
combined_score: 0.005422

In the results, the row with id = 3 has the word "restaurant" in it, so it gets a boost from the full-text score. As a result, it becomes the top-ranked item, ranked higher than the comment with id = 1. In contrast, if only the vector score was considered, the comment with id = 1 would be ranked higher than the comment with id = 3.

Blog Post: Why Your Vector Database Should Not Be a Vector Database
Blog Post: Announcing SingleStore Indexed ANN Vector Search
Blog Post: Hybrid Search: Vector + Full-Text Search
Working with Vector Data
Working with Full-Text Search
Tuning Vector Indexes and Queries
Vector Type
Vector Indexing

Hybrid Search - Re-ranking and Blending Searches

On this page

Reciprocal Rank Fusion

Examples

Example 1 - Hybrid Search of Wikipedia Articles

Example 2 - Hybrid Search of Comments

Was this article helpful?

On this page

Was this article helpful?

Hybrid Search - Re-ranking and Blending Searches

On this page

Reciprocal Rank Fusion

Examples

Example 1 - Hybrid Search of Wikipedia Articles

Example 2 - Hybrid Search of Comments

Related Topics

Was this article helpful?

On this page

Was this article helpful?