Configuring Full Text and Vector Indexes

Configuring SingleStore for high-performance vector and full-text search is important for achieving maximum performance from these searches. This page describes how to configure SingleStore for vector and full text search, focusing on configuration for high queries per second, followed by adjustments for other types of workloads and suggested optimizations.

This document is intended for readers with an understanding of SingleStore parallel query execution. See the list of resources at the end of this document for information about the concepts in this document.

Configure for High Throughput / High QPS

For high-throughput workloads where the goal is to maximize Queries Per Second (QPS), the key configuration is to have fewer, bigger segments, and make the segment size even.

In addition, to get high queries per second (QPS), it helps to change the flexible parallelism settings to use only one thread per leaf node per query, which will use one thread per partition when there is one segment per partition.

Concept

Vector and full-text search performance is highly related to the number of index segments scanned. Hence, to optimize performance, queries should scan the fewest index segments which translates to maximizing index segment size.

Conceptually, to minimize the number of segments scanned by each query, the configuration should use one database partition on each leaf node, the minimum number of segments within that partition, and one thread on each leaf node scanning that partition.

To do so, the following configurations are needed:

Use large segments: Allow columnstore segments to be as large as possible (by setting internal_columnstore_max_uncompressed_blob_size and columnstore_segment_rows to their maximum possible values).
Use few partitions and eliminate subpartitions: Use one partition per leaf node, or two partitions if using a small, single-leaf or two-leaf system and eliminate subpartitions.
Evenly distribute data among partitions: Choose a shard key that will evenly distribute data among partitions.
Turn down flexible parallelism: Set flexible parallelism so that one thread will scan each partition. This setting reduces per-query overhead and helps performance when the majority of the query execution work is for index searches.

Recommended Configuration

Note: It is important to create indexes and set the configuration variables before loading data.

To obtain the behavior described above and minimize the number of segments scanned, the recommended configuration settings are as follows:

Use few partitions and eliminate subpartitions:
- CREATE DATABASE <dbname> PARTITIONS <number of leaf nodes> SUB_PARTITIONS 0
Use large segments:
- SET GLOBAL internal_columnstore_max_uncompressed_blob_size = 10737418240
  - 10,737,418,240 is the maximum possible value of internal_columnstore_max_uncompressed_blob_size
  - Caution: It can take up to 3 times the amount of memory specified in this variable to create a blob. If you get OOM errors, reduce this value or scale up your RAM (say with a 2x or 4x cloud scaled instance).
- SET GLOBAL columnstore_segment_rows = 10000000
  - 10,000,000 is the maximum value of columnstore_segment_rows.
Evenly distribute data:
- Select a shard key that will evenly distribute data across partitions.
Turn down flexible parallelism:
- SET query_parallelism_per_leaf_core = 0.01
  - With this configuration, the engine uses one thread per leaf node.
  - This is a session variable and will need to be set for each connection.

Verification and Optimization

Run the OPTIMIZE TABLE FLUSH command after the index is built to flush all index segments to disk.
After the initial configuration is set, verify, using the command below, that there are no or few small segments. If there are small segments, consider a change to the shard key or the number of partitions.
SQL
```
SELECT
   DATABASE_NAME,
   TABLE_NAME,
   ORDINAL AS PARTITION_ID,
   ROWS,
   MEMORY_USE
FROM INFORMATION_SCHEMA.TABLE_STATISTICS
WHERE TABLE_NAME = '<table_name>';
```

Data Distribution Notes

When tuning for QPS, do not increase the partition count if doing so shrinks the segment size.

However, when the data is very large, that is the number of rows is much greater than (number of partitions) * (segment size), there will be many segments in each partition. In this case, if you run the OPTIMIZE TABLE FULL command, the segments will have roughly even sizes except for the tail segment. Given that there are a lot of segments, having one small segment is not very significant in percentage terms. Having no tiny stray segments is more important for small data sets with just one or a few segments per partition.

Avoid compaction during data ingestion by running the OPTIMIZE TABLE FULL command after data has been loaded into the table.

Thus, if the data size is large enough so that there is more than one maximum-sized segment per partition, you can increase the partition count without effect on QPS. Increasing the partition count in this case will not reduce the number of segments.

Workload

The configuration above is designed for workloads in which:

Multiple queries are run in parallel.
- When flexible parallelism is turned down so that there is one thread per leaf node, a concern is that there are extra VCPUs which will not be used. When multiple queries are running in parallel, these extra VCPUs are used.
Queries do not have highly selective predicates.
- When queries do not have highly selective predicates, segment-elimination is not effective, therefore a 'small' number of 'large' segments is a reasonable configuration.

Additional Configurations and Workload Considerations

Segment Size Tradeoffs

The index configuration (segment size) needs to be matched to the workload.

In SingleStore, vector and full-text indexes are built per (columnstore) segment. Vector and full-text indexes span all the sub-partitions in a partition. Large segments typically improve search speed because the cost of searching a large segment is only slightly more than the cost of searching a small segment (search time increases logarithmically with segment size).

While large segment sizes are good for QPS and for search performance, smaller segments may be appropriate for workloads with updates or workloads with highly selective queries. Queries with highly selective WHERE filters may benefit from smaller segments because segment elimination is applied before vector or full-text index searches.

When indexes are updated regularly, when latency is important and high parallelism is useful, or when there are highly selective queries, appropriate segment size needs to be considered.

Segment Size Tradeoffs for Update Performance

There is a tradeoff between full-text and vector index search performance and index update rate. Large segments improve index search performance while small segments improve index update rate. Thus there is a tradeoff between search speed and update speed. This tradeoff can be managed by adjusting the segment size.

To increase the index update rate, consider reducing the segment size or increasing the number of partitions.

Segment Size Tradeoffs for Highly Selective Queries

Similarly, there is a tradeoff between full-text and vector index search performance and segment elimination. Large segments improve index search performance; small segments improve segment elimination.

Segment elimination is the process whereby the system avoids scanning segments that do not meet the filter in the query. With highly selective filters, segment elimination will result in much less data being scanned and can significantly improve performance. In this case, smaller segments mean less data is scanned (small segments are more likely to be able to be 'eliminated' and a smaller number of rows are read from the segments that pass segment elimination). Thus, smaller segments are good for performance for queries with highly selective filters that are not for full-text search. An example of such a filter is a date range filter in the WHERE clause of a query.

That is, queries with highly selective predicates may benefit from an increased number of partitions and segments. The increased number of smaller partitions will enable more segment elimination.

Refer to Choosing a Columnstore Key for information on segment elimination.

Adjust the Number of Segments and Partitions

Segment size in SingleStore is managed through the variables:

internal_columnstore_max_uncompressed_blob_size (max value 10,737,418,240)
columnstore_segment_rows (max value 10,000,000)

As noted above, it's often best to set these to larger-than-default values (e.g. the maximum values) to improve vector and text search performance, especially if you don't have selective standard WHERE clause filters. You can also choose a smaller-than-default number of partitions in your CREATE DATABASE statement as recommended earlier.

Parallel Scan and Flexible Parallelism

SingleStore is a powerful parallel database which supports a spectrum of parallelism; different levels of parallelization are effective for different types of workloads. In general, to optimize for throughput or high queries per second and for workloads with many relatively small queries, flexible parallelism should be turned down. To optimize for response time, flexible parallelism should be turned up.

Flexible Parallelism

Flexible parallelism allows multiple threads to scan a single partition, increasing the parallelism (number of cores) used for a query. Flexible parallelism allows all cores to work on a particular query even if there are more cores than partitions. This decreases query latency. More specifically, flexible parallelism allows each core to scan a sub-partition, so multiple threads per query per leaf. This is great for certain workloads, but not for all workloads.

If you have multiple queries running concurrently, parallelism may not help. Multiple queries can use multiple cores, so there is less need for parallelism. However, if you have fewer queries running, parallelism can help improve performance. At a high level, if you have multiple concurrent queries, turn down the parallelism, but if you have few large queries running, turn up the parallelism. Flexible parallelism is managed with the engine variable query_parallelism_per_leaf_core.

Turning down parallelism by decreasing query_parallelism_per_leaf_core helps throughput by reducing overhead for starting and synchronizing threads. More importantly, turning down parallelism can also reduce total work by avoiding having multiple threads process the same segment, potentially each doing an index search independently, or doing ORDER BY ... LIMIT queries on more threads which could increase the total number of rows decoded and transmitted. This extra decoding and transmission can drive up the total CPU cycles needed to run a query.

Flexible Parallelism Configuration

When doing a parallel (non-indexed) scan:

If (flexible) parallelism is set to one (query_parallelism_per_leaf_core = 1), then there will be one thread per CPU (vCPU) core and each thread will scan different sub-partitions in parallel.
If (flexible) parallelism is set lower (query_parallelism_per_leaf_core < 1), then fewer threads will be used and each thread will scan multiple sub-partitions. Setting query_parallelism_per_leaf_core = .01 will give one thread per leaf node, which is ideal for high-throughput search workloads.

Configure for Large Data

For huge data, you may want more than one thread per leaf node to balance response time versus throughput. If throughput is still your top concern, you can still limit the threads per leaf node by setting query_parallelism_per_leaf_core = 0.01.

Configure for Response Time

To obtain the best response times for individual queries, use multiple threads per query. Set query_parallelism_per_leaf_core at its default value of 1.0 or some relatively high value such as 0.5 or 0.25. Increasing query parallelism increases parallelism at the cost of thread overhead. Thus, there is a tradeoff between increased parallelism and thread overhead which can be managed by setting query parallelism.

Benchmarking with Fixed Size Data

If your data is fixed size, for example, if you are benchmarking, follow this procedure to get data evenly distributed among partitions.

Calculate an even distribution of data among partitions and set columnstore_segment_rows to enforce that distribution.
- SET GLOBAL columnstore_segment_rows = ((#rows in data set)/(#partitions)) + buffer
  - Suggest value of ~1000 for buffer.
- OR if the dataset is larger: ((#rows in data set)/(#partitions)) > 10,000,000
  - SET GLOBAL columnstore_segment_rows = 10000000
  - 10,000,000 is the maximum value of columnstore_segment_rows

The buffer value will prevent inadvertent creation of a tiny segment containing the remainder of the data if data is not split perfectly evenly. A tiny segment can hurt throughput since it will have to be searched independently by a thread, and that could take noticeable time for index search startup. In addition, index search time is logarithmic, so more segments (and thus more indexes) are virtually always worse for throughput.

Full-Text Considerations and Optimizations

Below are a series of considerations and optimizations to consider for full-text search performance.

Indexing Limited Full-Text Columns

For full-text indexes, the number of columns that are indexed can affect search performance. Only index the columns that will be searched.

Full-Text ORDER BY … LIMIT Optimization

There is a significant optimization that pushes an ORDER BY…LIMIT clause down into the Java Lucene process if a query is structured in a certain way. Depending on the size of the data, this optimization can reduce a query's time from minutes to seconds.

For this optimization to work, the query must follow these rules:

The WHERE clause must either be empty or the expression in the WHERE clause must be the same full-text expression that is in the ORDER BY clause.
The ORDER BY clause must use DESC sort order.
There must be only one full-text search function (MATCH, BM25, or BM25_GLOBAL) in the query.

You can often "refactor" the FTS part out of a query to meet these rules and employ the optimization as shown in the following example.

Original query:

SQL

SELECT
    ucid, company_name, fc.company_description,
    MATCH (TABLE fc)
    AGAINST ('company_technologies:("font awesome" OR "open graph")') AS _match
FROM Company c
JOIN FTS_Company_new fc USING (ucid)
JOIN Person p USING(ucid)
ANTI_JOIN GlobalExclude_global ge ON ge.PersonId = p.upid
WHERE _match
GROUP BY ucid
ORDER BY _match DESC
LIMIT 25 OFFSET 0;

Rewritten query that benefits from the optimization:

SQL

WITH _fts AS (
    SELECT
        ucid, company_description,
        MATCH (table fc)
        AGAINST ('company_technologies:("font awesome" OR "open graph")')
            AS _match
    FROM FTS_Company_new fc
    WHERE _match ORDER BY _match desc limit 25
)
SELECT ucid, company_name, fc.company_description, fc._match
FROM Company c
JOIN _fts fc USING (ucid)
JOIN Person p USING(ucid)
ANTI_JOIN GlobalExclude_global ge ON ge.PersonId = p.upid
WHERE fc._match
GROUP BY ucid
ORDER BY fc._match DESC
LIMIT 25 OFFSET 0;

Factoring the full-text search away from the joins into a CTE allows the optimization to work.

FTS V2 Architecture and Performance

In FTS v2, the Lucene engine is decoupled from the main database engine and runs in a co-process. While this architecture allows SingleStore to surface the latest and greatest FTS features available in Lucene, it limits the ability to optimize filtering when both FTS matches and other non-FTS criteria are present. This behavior is because the Lucene co-process isn't deeply integrated into the SingleStore engine. Lucene must first return all matches, to which other filters can be applied and limits that may be present in the query. This causes overhead, particularly if there are a lot of FTS matches.

In queries with filters on multiple attributes, consider adding attributes that are used in filters to the full-text index so that the full WHERE clause can be moved into the Lucene expression. This enables optimization (within Lucene) and eliminates the overhead of passing all matches from the Lucene process back to the SingleStore process.

Throttling Full-Text Search

If full-text search is taking too much CPU time, full-text search can be throttled by reducing the value of the fts2_max_connections engine variable. The value of this variable, that is the number of connections, is roughly equal to the number of CPUs on each leaf to use. That is, if fts2_max_connections is set to 6, that means that approximately 6 CPUs on each leaf will be used.

Warm the Cache for Full-Text Search

The disk cache can be warmed for a full-text search by executing a query such as the following which will cause the full-text indexes to be downloaded into the local disk cache.

SQL

SELECT COUNT(*)
FROM table_name
WHERE MATCH(TABLE table_name) AGAINST ('valid_column:random_term');

Other Strategies

To get the best performance, get creative with application and even data changes. For example, can the database size be reduced? You may be able to eliminate older or less important data. Other ways to improve performance may be to add more hardware, or to right-size the hardware to the application and data needs. Another is to change the queries, for example, simplify the FTS portion of the queries, or break the queries into pieces.