Rebalance Failure Impact and Cleanup

Rebalancing failures affect a SingleStore cluster and may require cleanup procedures for each of the following phases:

  • COPY PARTITION

  • PROMOTE PARTITION (WITH or WITHOUT DROP)

  • DROP PARTITION

The details below are for recovery from rebalance failures during typical planned maintenance scenarios (scale‑out, node replacement, etc.).

Rebalance Phases: Conceptual Overview

Rebalance reorganizes database partitions to balance the number of partitions on leaf nodes and to ensure redundancy. To rebalance, the system needs to "move" some or all partitions to new nodes, promote some partitions to master, and drop some partitions.

The copy, promote and drop phases are performed for all partitions in a database. Thus for each database being rebalanced the actions are:

  • copy all partitions of the database that need to be copied.

  • promote all partitions of the database that need promotion.

  • drop all the no longer needed partitions of that database.

The rebalance plan can be seen with the EXPLAIN REBALANCE PARTITIONS command.

A full rebalance (REBALANCE ALL DATABASES / REBALANCE PARTITIONS) conceptually proceeds in three high‑level steps for each affected partition:

  1. COPY PARTITION

    • Create an additional copy of a partition on a target leaf.

    • Source and target both exist; metadata is prepared, but primary/replica roles are not yet switched.

  2. PROMOTE PARTITION (WITH or WITHOUT DROP)

    • Switch roles, for example, make the new copy the master, adjust replication.

    • If a partition that was a master before the rebalance is being dropped by the rebalance, PROMOTE PARTITION is called using WITH DROP, thereby dropping the old master immediately, instead of in step 3.

  3. DROP PARTITION

    • If the partition was a slave before the rebalance,  then, it is dropped in this step by  removinge the old, now‑redundant copy once the new placement is fully active and in sync.

    • Dropping a partition reduces extra storage usage and finalizes the new topology.

Failures can occur in any of these steps. The engine’s primary design goal is to avoid data loss and keep the cluster in a consistent state, even if the operation does not complete.

Failure during COPY PARTITION

The typical causes can be:

  • External interruption (e.g., KILL CONNECTION/ KILL QUERY, kill script).

  • Resource exhaustion (e.g., out of memory, I/O saturation).

  • Network or storage errors during data transfer.

  • Lock timeouts from concurrent DDLs (such as OPTIMIZE TABLE).

Impact on the cluster:

  • The source (original) partition remains intact as the authoritative copy.

  • The target copy may be incomplete or discarded; REBALANCE may report failure or in some edge cases, the high level command may have failed, but some internal actions (such as creating new copies) may have succeeded.

  • Metadata may reflect no new copy (copy discarded), or more replicas than expected for a given partition.

Critically, user data remains safe on the original master; the risk is mainly wasted time and temporary extra storage, not corruption.

Failure during PROMOTE PARTITION (WITH/WITHOUT DROP)

The typical causes can be:

  • Timeouts in internal synchronization (for example, _SYNC_PARTITIONS, WaitForLSNWithTimeout).

  • Locks or internal dependencies that prevent fast role transition.

  • Resource pressure (CPU, memory, or merger activity) slowing replication or state updates.

Impact on the cluster:

  • Copy phase may have succeeded and now both old and new copies exist.

  • The old master remains the authoritative partition and the new copy remains a secondary/unpromoted replica or is left unused.

  • Data integrity is preserved; the cluster continues using the old placement.

  • In some cases, there may be extra replicas / unused copies, but not fewer than what is required for redundancy.

Failure during DROP PARTITION

The typical causes can be:

  • Errors when cleaning up old copies (for example, filesystem issues, transient node problems).

  • Operation is interrupted after new placement is live but before old copies are fully dropped.

Impact on the cluster:

  • New partition placement is already active and serving queries.

  • Old copies remain on disk (and may still be visible as extra replicas or orphans).

  • This primarily affects storage usage (more disk consumed than necessary) and operational clarity (extra copies may appear in low‑level views).

  • Database availability and data correctness are not impacted.

General Post‑Failure Recovery Flow

When a rebalance fails (regardless of the phase), the following standard flow is recommended:

  1. Record the failure time and error:

    • From the client use, SHOW REBALANCE ... STATUS, and memsql.log.

  2. Verify data safety and redundancy:

    SELECT * FROM information_schema.distributed_partitions_on_leaves
    WHERE database_name = 'your_db';

    Ensure each partition has at least one master.

  3. Plan a safe retry:

    • Address underlying issues:

      • Resource limits (CPU, memory, I/O).

      • Heavy merger load.

      • Backup/maintenance overlap.

    • Preferably use a maintenance window with no user workload.

  4. Retry rebalance:

    EXPLAIN REBALANCE ALL DATABASES;
    REBALANCE ALL DATABASES;
  5. Final cleanup (if needed)

    • Once satisfied that the cluster is balanced and redundancy is correct:

      EXPLAIN CLEAR ORPHAN DATABASES;
      CLEAR ORPHAN DATABASES;

Last modified: March 13, 2026

Was this article helpful?

Verification instructions

Note: You must install cosign to verify the authenticity of the SingleStore file.

Use the following steps to verify the authenticity of singlestoredb-server, singlestoredb-toolbox, singlestoredb-studio, and singlestore-client SingleStore files that have been downloaded.

You may perform the following steps on any computer that can run cosign, such as the main deployment host of the cluster.

  1. (Optional) Run the following command to view the associated signature files.

    curl undefined
  2. Download the signature file from the SingleStore release server.

    • Option 1: Click the Download Signature button next to the SingleStore file.

    • Option 2: Copy and paste the following URL into the address bar of your browser and save the signature file.

    • Option 3: Run the following command to download the signature file.

      curl -O undefined
  3. After the signature file has been downloaded, run the following command to verify the authenticity of the SingleStore file.

    echo -n undefined |
    cosign verify-blob --certificate-oidc-issuer https://oidc.eks.us-east-1.amazonaws.com/id/CCDCDBA1379A5596AB5B2E46DCA385BC \
    --certificate-identity https://kubernetes.io/namespaces/freya-production/serviceaccounts/job-worker \
    --bundle undefined \
    --new-bundle-format -
    Verified OK

Try Out This Notebook to See What’s Possible in SingleStore

Get access to other groundbreaking datasets and engage with our community for expert advice.