Important

The SingleStore 9.1 release candidate (RC) gives you the opportunity to preview, evaluate, and provide feedback on new and upcoming features prior to their general availability. In the interim, SingleStore 9.0 is recommended for production workloads, which can later be upgraded to SingleStore 9.1.

Configuring Vector Indexes

Configure for High Throughput

For high-throughput workloads where the goal is to maximize Queries Per Second (QPS), it is important to have fewer, bigger segments, and make the segment size even.

In addition, to get high QPS, it helps to change the flexible parallelism settings to use only one thread per leaf node per query, which will use one thread per partition when there is one segment per partition.

Concept

Vector search performance is highly related to the number of index segments scanned. Hence, to optimize performance, queries should scan the fewest index segments.

The vector index merger runs automatically in the background to combine per-segment vector indexes and create cross-segment vector indexes. The merger helps minimize the number of segments within a partition, reducing the number of indexes that must be examined for a query and improving vector search performance.

In most cases, the vector index merger is sufficient; however, the maximum combined index size created by the vector index merger is limited by the following factors:

Combined indexes created by the vector index merger contain data from at most four segments.
Segment size is limited by internal_columnstore_max_uncompressed_blob_size and columnstore_segment_rows.

If the vector index merger does not produce sufficiently large indexes, to minimize the number of segments scanned by each query, the configuration should use one database partition on each leaf node, the minimum number of segments within that partition, and one thread on each leaf node scanning that partition.

Configurations for High-Performance Search

Note: It is important to create indexes and set the configuration variables before loading data.

To maximize QPS for vector search, use the following configurations in order. For discussions of tradeoffs with respect to segment size and flexible parallelism, refer to Workload Considerations.

Evenly distribute data:
- Select a shard key that will evenly distribute data across partitions.
Use few partitions and eliminate subpartitions:
- CREATE DATABASE <dbname> PARTITIONS <number of leaf nodes> SUB_PARTITIONS 0
Turn down flexible parallelism:
- SET query_parallelism_per_leaf_core = 0.01
  - With this configuration, the engine uses one thread per leaf node.
  - This is a session variable and will need to be set for each connection.
Use larger segments:
- Increase the value of internal_columnstore_max_uncompressed_blob_size.
  - The default value is 512 MB (536870912 bytes) and the largest value it can be is 10GB (10737418240 bytes).
  - Caution: It can take up to 3 times the amount of memory specified in this variable to create a blob. If you get OOM errors, reduce this value or scale up your RAM (say with a 2x or 4x cloud scaled instance).
- Increase the value of columnstore_segment_rows.
  - The default number of rows in a columnstore segment is 1,024,000; 10,000,000 is the maximum value.

Note

Run the OPTIMIZE TABLE FLUSH command after the index is built to flush all index segments to disk.

Verification and Optimization

Use the command below to verify that there are no or few small segments. If there are small segments, consider a change to the shard key or the number of partitions.

SQL

SELECT
   DATABASE_NAME,
   TABLE_NAME,
   ORDINAL AS PARTITION_ID,
   ROWS,
   MEMORY_USE
FROM INFORMATION_SCHEMA.TABLE_STATISTICS
WHERE TABLE_NAME = '<table_name>';

Data Distribution Notes

When tuning for QPS, do not increase the partition count if doing so shrinks the segment size.

However, when the data is very large, that is the number of rows is much greater than (number of partitions) * (segment size), there will be many segments in each partition. In this case, if you run the OPTIMIZE TABLE FULL command, the segments will have roughly even sizes except for the tail segment. Given that there are a lot of segments, having one small segment is not very significant in percentage terms. Having no tiny stray segments is more important for small data sets with just one or a few segments per partition.

Avoid compaction during data ingestion by running the OPTIMIZE TABLE FULL command after data has been loaded into the table.

Thus, if the data size is large enough so that there is more than one maximum-sized segment per partition, you can increase the partition count without effect on QPS. Increasing the partition count in this case will not reduce the number of segments.

Workload

The configuration above is designed for workloads in which:

Multiple queries are run in parallel.
- When flexible parallelism is turned down so that there is one thread per leaf node, a concern is that there are extra VCPUs which will not be used. When multiple queries are running in parallel, these extra VCPUs are used.
Queries do not have highly selective predicates.
- When queries do not have highly selective predicates, segment-elimination is not effective, therefore a 'small' number of 'large' segments is a reasonable configuration.

Workload Considerations

Segment Size Tradeoffs

The index configuration (segment size) needs to be matched to the workload.

In SingleStore, vector indexes are built per (columnstore) segment. Vector indexes span all the sub-partitions in a partition. Large segments typically improve search speed because the cost of searching a large segment is only slightly more than the cost of searching a small segment (search time increases logarithmically with segment size).

While large segment sizes are good for QPS and for search performance, smaller segments may be appropriate for workloads with updates or workloads with highly selective queries. Queries with highly selective WHERE filters may benefit from smaller segments because segment elimination is applied before vector index searches.

When indexes are updated regularly, when latency is important and high parallelism is useful, or when there are highly selective queries, appropriate segment size needs to be considered.

Segment Size Tradeoffs for Update Performance

There is a tradeoff between vector index search performance and index update rate. Large segments improve index search performance while small segments improve index update rate. Thus there is a tradeoff between search speed and update speed. This tradeoff can be managed by adjusting the segment size.

To increase the index update rate, consider reducing the segment size or increasing the number of partitions.

Segment Size Tradeoffs for Highly Selective Queries

Similarly, there is a tradeoff between vector index search performance and segment elimination. Large segments improve index search performance; small segments improve segment elimination.

Segment elimination is the process whereby the system avoids scanning segments that do not meet the filter in the query. With highly selective filters, segment elimination will result in much less data being scanned and can significantly improve performance. In this case, smaller segments mean less data is scanned (small segments are more likely to be able to be 'eliminated' and a smaller number of rows are read from the segments that pass segment elimination). Thus, smaller segments are good for performance for queries with highly selective filters that are not for vector search. An example of such a filter is a date range filter in the WHERE clause of a query.

That is, queries with highly selective predicates may benefit from an increased number of partitions and segments. The increased number of smaller partitions will enable more segment elimination.

Refer to Choosing a Columnstore Key for information on segment elimination.

Adjust the Number of Segments and Partitions

Segment size in SingleStore is managed through the variables:

internal_columnstore_max_uncompressed_blob_size (max value 10,737,418,240)
columnstore_segment_rows (max value 10,000,000)

As noted above, it's often best to set these to larger-than-default values (e.g. the maximum values) to improve vector and text search performance, especially if you don't have selective standard WHERE clause filters. You can also choose a smaller-than-default number of partitions in your CREATE DATABASE statement as recommended earlier.

Parallel Scan and Flexible Parallelism

SingleStore is a powerful parallel database which supports a spectrum of parallelism; different levels of parallelization are effective for different types of workloads. In general, to optimize for throughput or high queries per second and for workloads with many relatively small queries, flexible parallelism should be turned down. To optimize for response time, flexible parallelism should be turned up.

Flexible Parallelism

Flexible parallelism allows multiple threads to scan a single partition, increasing the parallelism (number of cores) used for a query. Flexible parallelism allows all cores to work on a particular query even if there are more cores than partitions. This decreases query latency. More specifically, flexible parallelism allows each core to scan a sub-partition, so multiple threads per query per leaf. This is great for certain workloads, but not for all workloads.

If you have multiple queries running concurrently, parallelism may not help. Multiple queries can use multiple cores, so there is less need for parallelism. However, if you have fewer queries running, parallelism can help improve performance. At a high level, if you have multiple concurrent queries, turn down the parallelism, but if you have few large queries running, turn up the parallelism. Flexible parallelism is managed with the engine variable query_parallelism_per_leaf_core.

Turning down parallelism by decreasing query_parallelism_per_leaf_core helps throughput by reducing overhead for starting and synchronizing threads. More importantly, turning down parallelism can also reduce total work by avoiding having multiple threads process the same segment, potentially each doing an index search independently, or doing ORDER BY ... LIMIT queries on more threads which could increase the total number of rows decoded and transmitted. This extra decoding and transmission can drive up the total CPU cycles needed to run a query.

Flexible Parallelism Configuration

When doing a parallel (non-indexed) scan:

If (flexible) parallelism is set to one (query_parallelism_per_leaf_core = 1), then there will be one thread per CPU (vCPU) core and each thread will scan different sub-partitions in parallel.
If (flexible) parallelism is set lower (query_parallelism_per_leaf_core < 1), then fewer threads will be used and each thread will scan multiple sub-partitions. Setting query_parallelism_per_leaf_core = .01 will give one thread per leaf node, which is ideal for high-throughput search workloads.

Configure for Large Data

For huge data, you may want more than one thread per leaf node to balance response time versus throughput. If throughput is still your top concern, you can still limit the threads per leaf node by setting query_parallelism_per_leaf_core = 0.01.

Configure for Response Time

To obtain the best response times for individual queries, use multiple threads per query. Set query_parallelism_per_leaf_core at its default value of 1.0 or some relatively high value such as 0.5 or 0.25. Increasing query parallelism increases parallelism at the cost of thread overhead. Thus, there is a tradeoff between increased parallelism and thread overhead which can be managed by setting query parallelism.

Benchmarking with Fixed Size Data

If your data is fixed size, for example, if you are benchmarking, follow this procedure to get data evenly distributed among partitions.

Calculate an even distribution of data among partitions and set columnstore_segment_rows to enforce that distribution.
- SET GLOBAL columnstore_segment_rows = ((#rows in data set)/(#partitions)) + buffer
  - Suggest value of ~1000 for buffer.
- OR if the dataset is larger: ((#rows in data set)/(#partitions)) > 10,000,000
  - SET GLOBAL columnstore_segment_rows = 10000000
  - 10,000,000 is the maximum value of columnstore_segment_rows

The buffer value will prevent inadvertent creation of a tiny segment containing the remainder of the data if data is not split perfectly evenly. A tiny segment can hurt throughput since it will have to be searched independently by a thread, and that could take noticeable time for index search startup. In addition, index search time is logarithmic, so more segments (and thus more indexes) are virtually always worse for throughput.