Important
The SingleStore 9.1 release candidate (RC) gives you the opportunity to preview, evaluate, and provide feedback on new and upcoming features prior to their general availability. In the interim, SingleStore 9.0 is recommended for production workloads, which can later be upgraded to SingleStore 9.1.
Configuring Full-Text Indexes
On this page
This topic provides information on how to tune ingest and search on tables with full-text version 2 indexes.
Tune Ingest on Tables with Full-Text Indexes
This section focuses on tuning tables with full-text version 2 indexes for high‑throughput ingest.
Key Recommendations
-
Set the number of partitions in a database based on query-parallelism needs, not by ingest throughput concerns.
-
Use the full-text index threadpool to maintain ingest throughput at lower partition counts.
Partition Impact
-
Fewer partitions per leaf (more cores per partition) is generally recommended for better query performance with flexible parallelism.
-
More partitions create more background flusher threads, which may help ingest throughput.
However, SingleStore recommends choosing the number of partitions based on query needs, unless ingest performance is of primary importance.
Note
SingleStore recommends addressing ingest issues by using the full-text index threadpool and periodic calls to OPTIMIZE TABLE .
Full-Text Index Ingestion and Creation Process
Full-text indexes are built using the following process.
-
The rows are inserted into the table's rowstore segments.
-
The rowstore segments are flushed to disk to create columnstore segments.
-
Full-text indexes are created when the segments are flushed.
The flushes may occur due to:
Version 2 full-text indexes are built for each column with a full-text index defined on it when the associated row segment is flushed to disk and the columnstore segment is created.
Configure Index Build
Full-text index build is affected by several mechanisms including the full-text index threadpool, the full-text index merger, the flush threads, and the background merger.
-
Full-Text Index Threadpool: When the full-text index threadpool is enabled with
fts2_, each indexing request is decomposed into smaller tasks that are run by a threadpool; otherwise a single thread handles an entire segment’s indexing.use_ index_ threadpool -
Full-text index merger: The full-text index merger combines per-segment full-text indexes to create cross-segment full-text indexes.
The merged indexes can improve the performance of full-text search queries by reducing the number of indexes that must be examined for a query. The full-text index merger is enabled with enable_.fts2_ merger -
Concurrent Flush Threads: Multiple flush threads can send concurrent indexing requests to JLucene (SingleStore's Lucene‑based full-text service).
The concurrency of these requests is bounded by fts2_.max_ connections From the engine’s perspective, each indexing request (sent by a flush thread) is synchronous. -
Background Merger: The background merger re‑indexes full-text data when it merges columnstore segments.
This background merging uses compute resources for full-text indexing. -
If flushing is too frequent, which may be caused by overly frequent calls to
OPTIMIZE TABLE … FLUSH, or low settings ofcolumnstore_orflush_ bytes columnstore_, the engine creates many small segments, which increases background re‑indexing work.segment_ rows
-
Refer to Tune Ingest for information on setting these variables.
Key Engine Variables
This table describes the function of key engine variables that impact full-text index creation.
|
Engine Variable / Command |
Description of Function |
|---|---|
|
|
Enables parallelization of indexing tasks for a given flush request. |
|
|
Controls the size of the full-text indexing thread pool. |
|
|
Enables the full-text index merger. |
|
|
Specifies the upper limit on the size of an index that can be merged by the full-text index merger. |
|
|
Upper bound on concurrent JLucene indexing requests from flush threads; increase carefully to match CPU budget and avoid oversubscription. |
|
|
Threshold for flushing rowstore‑backed data to columnstore. |
|
|
Alternative threshold for flushing rowstore-backed data to columnstore that triggers flushing by row count; coordinate with flush bytes to control segment sizes and flush cadence. |
|
|
Manual flush to help background flusher keep up with ingest. |
Tune Ingest
-
Start with fewer partitions aligned to query performance (e.
g. , >1 core/partition), then turn on the full-text index threadpool. Verify ingest uniformity and determine whether throttling delays ingest as described in the Monitor Ingest section. -
If ingest still falls behind, schedule periodic calls to OPTIMIZE TABLE FLUSH during ingest to allocate more resources to flushing.
Tune the frequency to avoid producing many small segments. -
Increase
columnstore_to form larger segments and reduce flush overhead.flush_ bytes -
Calibrate
fts2_so multiple flush threads can index concurrently without CPU oversubscription.max_ connections Watch the Java (JLucene) CPU to ensure indexing is active and healthy. -
Keep the
fts2_fixed while testing different partition counts to determine whether the bottleneck is in flushing or indexing.index_ threadpool_ size If higher partitions help while the threadpool is turned off, the system is flushing‑bound, and indexing is not throttling ingest. SingleStore recommends increasing flush resources (by increasing the OPTIMIZE TABLE .cadence) instead of increasing the partition to node ratio, which permanently fractures data.. . FLUSH -
SingleStore recommends starting with
fts2_set to the number of cores on a leaf node.index_ threadpool_ size
-
-
Once ingest is stable, iterate towards fewer partitions by relying on the indexing threadpool and a measured
OPTIMIZE TABLE .schedule to preserve query performance while sustaining ingest throughput.. . FLUSH
Monitor Ingest and Obtain Diagnostics
-
Track pending flush pressure by observing columnstore ingest management metrics such as
columnstore_.ingest_ management_ estimated_ segments_ to_ flush A sharp rise in the value of this variable towards a threshold (e. g. , around cores × 10) typically precedes throttling. -
Compare full-text index sizes and counts per segment using the following query.
SELECT database_name, table_name, node_id, sizeFROM information_schema.MV_COLUMNSTORE_FILESWHERE file_type = 'FULLTEXT'; -
Watch the CPU usage of the Java process (JLucene) during flush; high utilization indicates indexing is active and can explain differences across runs.
Full-Text Index Management
In SingleStore, a full-text index is built for each columnstore segment.ALTER command.
Monitor Index Build Status
For large tables, there are often multiple segments per partition.
SHOW FULLTEXT INDEX STATUS FOR <table_name>;
SELECT *FROM information_schema.MV_FULLTEXT_INDEX_STATUSWHERE DATABASE_NAME = "<database_name>"AND TABLE_NAME = "<table_name>";
Note: While full-text indexes are initially created per-segment, those indexes may later be merged.
Broken Indexes
In rare situations, an index creation may fail due to issues such as insufficient disk space or memory.
If an index creation failure occurs, the index will be flagged as "broken" and an error message is returned when a user searches against the broken index.
To identify broken indexes, use the SHOW FULLTEXT INDEX STATUS command or query the MV_
SHOW FULLTEXT INDEX STATUS FOR <table_name>;
SELECT *FROM information_schema.MV_FULLTEXT_INDEX_STATUSWHERE DATABASE_NAME = "<database_name>"AND TABLE_NAME = "<table_name>"AND BROKEN_PERCENT > 0;
To attempt to fix a broken index, run the OPTIMIZE TABLE … FIX_ command.
OPTIMIZE TABLE <table_name> FIX_FULLTEXT;
However, full-text indexes cannot be repaired by the system in all cases, including:
-
There is an invalid path for the full-text JLucene service.
-
The Java runtime used by the JLucene service runs out of memory.
Consider increasing the value of the fts2_global variable if this occurs.max_ memory_ mb -
The size of the data being exchanged between the engine and the JLucene process is too large.
In this case, consider increasing the value of the fts2_global variable.max_ service_ buffer_ mb -
A term to be indexed is longer than the maximum length of
32766.This is a limitation of JLucene and may be addressed by truncating or breaking up tokens in the data itself.
Debug Full-Text Index Tokens
To debug full-text search query behavior, it can be helpful to understand how the input documents are broken into tokens.
The tokens that will be generated for a full-text index can be viewed using the ANALYZE FULLTEXT command.INDEX_ from the relevant index definition as the OPTIONS for the ANALYZE FULLTEXT command.
Memory Usage
The Alloc_ status variable is the number of allocated bytes (out of the total max_) that is attributable to the next-generation (VERSION 2) full-text subprocess.
View the value of Alloc_ on the leaf nodes with the SHOW LEAF STATUS EXTENDED command.
SHOW LEAF STATUS EXTENDED LIKE '%fts%';
The value of Alloc_ on the aggregator nodes can be viewed with the SHOW STATUS EXTENDED command.Alloc_ variable will have a value on the aggregator nodes if a search has been run using the BM25_ search function.BM25_ search has been used.
SHOW STATUS EXTENDED LIKE '%fts%';
Full-Text Index Merger
When a full-text index is built, a table-level index is created for the text column.
The cost of full-text search is driven by the number of indexes searched.
The full-text index merger:
-
Is enabled with the
enable_engine variable.fts2_ merger SET GLOBAL enable_fts2_merger = ON; -
Can merge any number of smaller indexes.
-
Has an upper limit on the size of an index that can be merged, controlled by
fts2_.merge_ soft_ limit_ mb This limit is not strictly enforced and may be exceeded in some cases. -
Is used when the query is able to take advantage of the
ORDER BY LIMIToptimization.
SingleStore recommends using the largest possible fts2_ and nodes big enough to handle that size of index.
The full-text index merger automatically runs in the background to continuously improve search performance.OPTIMIZE TABLE . command is run.
OPTIMIZE TABLE <table_name> FULL;
Full-Text Index Threadpool
The full-text index threadpool uses multiple threads to build full-text indexes.
Two engine variables are available to configure the threadpool.
|
Engine Variable |
Description of Function |
|---|---|
|
|
Enables parallelization of indexing tasks for a given flush request. |
|
|
Controls the size of the full-text indexing thread pool. |
SingleStore recommends setting fts2_ to the number of cores on a leaf node.
Full-Text Search Performance
Following are a series of considerations and optimizations to consider for full-text search performance.
Query Optimization
Indexing Limited Full-Text Columns
For full-text indexes, the number of columns that are indexed can affect search performance.
Full-Text ORDER BY … LIMIT Optimization
There is a significant optimization that pushes an ORDER BY…LIMIT clause down into the Java Lucene process if a query is structured in a certain way.
For this optimization to work, the query must follow these rules:
-
The
ORDER BYclause must useDESCsort order. -
There must be only one full-text search function (
MATCH,BM25, orBM25_) in the query.GLOBAL
The full-text portion of a query can often be refactored out of a query to meet these rules and employ the optimization as shown in the following example.
Original query:
SELECTucid, company_name, fc.company_description,MATCH (TABLE fc)AGAINST ('company_technologies:("font awesome" OR "open graph")') AS _matchFROM Company cJOIN FTS_Company_new fc USING (ucid)JOIN Person p USING(ucid)ANTI_JOIN GlobalExclude_global ge ON ge.PersonId = p.upidWHERE _matchGROUP BY ucidORDER BY _match DESCLIMIT 25 OFFSET 0;
Rewritten query that benefits from the optimization:
WITH _fts AS (SELECTucid, company_description,MATCH (TABLE fc)AGAINST ('company_technologies:("font awesome" OR "open graph")')AS _matchFROM FTS_Company_new fcWHERE _match ORDER BY _match desc limit 25)SELECT ucid, company_name, fc.company_description, fc._matchFROM Company cJOIN _fts fc USING (ucid)JOIN Person p USING(ucid)ANTI_JOIN GlobalExclude_global ge ON ge.PersonId = p.upidWHERE fc._matchGROUP BY ucidORDER BY fc._match DESCLIMIT 25 OFFSET 0;
Factoring the full-text search away from the joins into a CTE allows the optimization to work.
Add Attributes Used in Filters to Full-Text Index
In full-text search (FTS) v2, the Lucene engine is decoupled from the main database engine and runs in a co-process.
In queries with filters on multiple attributes, consider adding attributes that are used in filters to the full-text index so that the full WHERE clause can be moved into the Lucene expression.
Caching
Full-Text Index Cache
The full-text index cache stores full-text index data in memory to improve search performance.
Enable the full-text index cache with the fts2_ engine variable.
SET GLOBAL fts2_cache_enable = ON;
The fts2_ engine variable controls the size of the cached blobs.
Stop the full-text service using the FULLTEXT SERVICE STOP command after changing the value of fts2_ to ensure the change takes effect.
Statistics Cache in Full-Text Search Service
The statistics cache in the full-text V2 service is designed to store frequently accessed term and collection statistics used during full-text scoring computations, particularly for BM25 and BM25_
Enable the full-text statistics cache with the following command:
SET GLOBAL fts2_stats_cache_enable = TRUE;
The following engine variables are used to configure the full-text statistics cache:
-
fts2_: Specifies the maximum number of index blobs (segments) that can be stored in the statistics cache at any given time.stats_ cache_ max_ size Helps manage memory usage by limiting the number of cached statistics, ensuring the system does not overuse memory. -
fts2_: Specifies the maximum number of term statistics that can be cached for each index blob (segment).stats_ cache_ max_ terms Controls the granularity of cached data, balancing memory consumption against query performance for term-based scoring models like BM25.
Warm the Cache for Full-Text Search
The disk cache can be warmed for a full-text search by running a query such as the following which will cause the full-text indexes to be downloaded into the local disk cache.
SELECT COUNT(*)FROM table_nameWHERE MATCH(TABLE table_name) AGAINST ('valid_column:random_term');
Resource Management
Throttle Full-Text Search
If full-text search is taking too much CPU time, full-text search can be throttled by reducing the value of the fts2_ engine variable.fts2_ is set to 6, that means that approximately 6 CPUs on each leaf will be used.
Other Strategies
To get the best performance, get creative with application and even data changes.
Segment Size Impacts
Segment size impacts search and ingest.
-
Large segments typically improve search speed because the cost of searching a large segment is only slightly more than the cost of searching a small segment (search time increases logarithmically with segment size).
-
Large segment sizes are typically good for Queries Per Second (QPS) and for search performance.
-
Smaller segments may be appropriate for workloads with updates or workloads with highly selective queries.
-
Queries with highly selective
WHEREfilters may benefit from smaller segments because segment elimination is applied before full-text index searches.
-
-
Large segments improve index search performance; small segments improve segment elimination.
When indexes are updated regularly, when latency is important and high parallelism is useful, or when there are highly selective queries, appropriate segment size needs to be considered.
There is a tradeoff between search performance and index update rate.
Segment size in SingleStore is managed through the variables internal_ and columnstore_.
Note
Note: Before reducing segment size to increase the ingest rate, turn on the full-text index threadpool.
Segment Elimination
Segment elimination is the process whereby the system avoids scanning segments that do not meet the filter in the query.WHERE clause of a query.
That is, queries with highly selective predicates may benefit from an increased number of partitions and segments.
Refer to Choosing a Columnstore Key for information on segment elimination.
Parallel Scan and Flexible Parallelism
SingleStore is a powerful parallel database which supports a spectrum of parallelism; different levels of parallelization are effective for different types of workloads.
Flexible Parallelism
Flexible parallelism allows multiple threads to scan a single partition, increasing the parallelism (number of cores) used for a query.
If you have multiple queries running concurrently, parallelism may not help.query_.
Turning down parallelism by decreasing query_ helps throughput by reducing overhead for starting and synchronizing threads.ORDER BY . queries on more threads which could increase the total number of rows decoded and transmitted.
Flexible Parallelism Configuration
When doing a parallel (non-indexed) scan:
-
If (flexible) parallelism is set to one (
query_), then there will be one thread per CPU (vCPU) core and each thread will scan different sub-partitions in parallel.parallelism_ per_ leaf_ core = 1 -
If (flexible) parallelism is set lower (
query_), then fewer threads will be used and each thread will scan multiple sub-partitions.parallelism_ per_ leaf_ core < 1 Setting query_will give one thread per leaf node, which is ideal for high-throughput search workloads.parallelism_ per_ leaf_ core = . 01
Configure for Large Data
For huge data, you may want more than one thread per leaf node to balance response time versus throughput.query_.
Configure for Response Time
To obtain the best response times for individual queries, use multiple threads per query.query_ at its default value of 1. or some relatively high value such as 0. or 0..
Last modified: