Rebalance Failure Impact and Cleanup
On this page
Rebalancing failures affect a SingleStore cluster and may require cleanup procedures for each of the following phases:
-
COPY PARTITION
-
PROMOTE PARTITION (WITH or WITHOUT DROP)
-
DROP PARTITION
The details below are for recovery from rebalance failures during typical planned maintenance scenarios (scale‑out, node replacement, etc.
Rebalance Phases: Conceptual Overview
Rebalance reorganizes database partitions to balance the number of partitions on leaf nodes and to ensure redundancy.
The copy, promote and drop phases are performed for all partitions in a database.
-
copy all partitions of the database that need to be copied.
-
promote all partitions of the database that need promotion.
-
drop all the no longer needed partitions of that database.
The rebalance plan can be seen with the EXPLAIN REBALANCE PARTITIONS command.
A full rebalance (REBALANCE ALL DATABASES / REBALANCE PARTITIONS) conceptually proceeds in three high‑level steps for each affected partition:
-
COPY PARTITION
-
Create an additional copy of a partition on a target leaf.
-
Source and target both exist; metadata is prepared, but primary/replica roles are not yet switched.
-
-
PROMOTE PARTITION (WITH or WITHOUT DROP)
-
Switch roles, for example, make the new copy the master, adjust replication.
-
If a partition that was a master before the rebalance is being dropped by the rebalance,
PROMOTE PARTITIONis called usingWITH DROP, thereby dropping the old master immediately, instead of in step 3.
-
-
DROP PARTITION
-
If the partition was a slave before the rebalance, then, it is dropped in this step by removinge the old, now‑redundant copy once the new placement is fully active and in sync.
-
Dropping a partition reduces extra storage usage and finalizes the new topology.
-
Failures can occur in any of these steps.
Failure during COPY PARTITION
The typical causes can be:
-
External interruption (e.
g. , KILL CONNECTION/KILL QUERY, kill script). -
Resource exhaustion (e.
g. , out of memory, I/O saturation). -
Network or storage errors during data transfer.
-
Lock timeouts from concurrent DDLs (such as
OPTIMIZE TABLE).
Impact on the cluster:
-
The source (original) partition remains intact as the authoritative copy.
-
The target copy may be incomplete or discarded;
REBALANCEmay report failure or in some edge cases, the high level command may have failed, but some internal actions (such as creating new copies) may have succeeded. -
Metadata may reflect no new copy (copy discarded), or more replicas than expected for a given partition.
Critically, user data remains safe on the original master; the risk is mainly wasted time and temporary extra storage, not corruption.
Failure during PROMOTE PARTITION (WITH/WITHOUT DROP)
The typical causes can be:
-
Timeouts in internal synchronization (for example,
_,SYNC_ PARTITIONS WaitForLSNWithTimeout). -
Locks or internal dependencies that prevent fast role transition.
-
Resource pressure (CPU, memory, or merger activity) slowing replication or state updates.
Impact on the cluster:
-
Copy phase may have succeeded and now both old and new copies exist.
-
The old master remains the authoritative partition and the new copy remains a secondary/unpromoted replica or is left unused.
-
Data integrity is preserved; the cluster continues using the old placement.
-
In some cases, there may be extra replicas / unused copies, but not fewer than what is required for redundancy.
Failure during DROP PARTITION
The typical causes can be:
-
Errors when cleaning up old copies (for example, filesystem issues, transient node problems).
-
Operation is interrupted after new placement is live but before old copies are fully dropped.
Impact on the cluster:
-
New partition placement is already active and serving queries.
-
Old copies remain on disk (and may still be visible as extra replicas or orphans).
-
This primarily affects storage usage (more disk consumed than necessary) and operational clarity (extra copies may appear in low‑level views).
-
Database availability and data correctness are not impacted.
General Post‑Failure Recovery Flow
When a rebalance fails (regardless of the phase), the following standard flow is recommended:
-
Record the failure time and error:
-
From the client use,
SHOW REBALANCE ., and. . STATUS memsql..log
-
-
Verify data safety and redundancy:
SELECT * FROM information_schema.distributed_partitions_on_leavesWHERE database_name = 'your_db';Ensure each partition has at least one master.
-
Plan a safe retry:
-
Address underlying issues:
-
Resource limits (CPU, memory, I/O).
-
Heavy merger load.
-
Backup/maintenance overlap.
-
-
Preferably use a maintenance window with no user workload.
-
-
Retry rebalance:
EXPLAIN REBALANCE ALL DATABASES;REBALANCE ALL DATABASES; -
Final cleanup (if needed)
-
Once satisfied that the cluster is balanced and redundancy is correct:
EXPLAIN CLEAR ORPHAN DATABASES;CLEAR ORPHAN DATABASES;
-
Last modified: March 13, 2026