Configure Table Slicing

Slicing divides the source table into multiple independent chunks of data that can be transferred to the destination in parallel. Instead of copying the entire table in a single thread, XL Ingest splits the work across multiple threads, with each thread responsible for a portion of the data. When configuring slicing, select a column to slice on, typically the primary key or an indexed column with evenly-distributed values. XL Ingest divides the data into slices based on value ranges in the slice column. Each range becomes a slice that transfers independently.

XL Ingest provides two approaches to slicing: manual and automatic.

Manual Slicing

Manual slicing provides full control over how the data is divided by allowing the specification of exact values for slice boundaries as a space-separated list. Choose slice boundaries such that each slice is under 10 GB to prevent long-running slices and ensure consistent forward progress.

Understanding Slice Boundaries

In manual slicing, N boundary values create N-1 slices. For example, boundaries 100 500 1000 create two slices. The first slice includes rows with slice column values >= 100 and < 500 and the second slice includes rows with values >= 500 and < 1000. To capture all records, set the first boundary less than or equal to the minimum value of the slice column, and the last boundary to a value greater than any slice column value that may occur during extraction.

Required inputs for manual slicing:

Slice Column: The column to slice on.
Slice Boundaries: Space-separated values that define how to split the data into slices.

To configure and run manual slicing:

Navigate to the Table Setup tab in XL Ingest and select the table from the list on the left.
In the Table Settings panel, select a column to use for slicing.
In the Slices field, enter space-separated boundary values to define slice ranges.
Optionally, enter a Where clause to filter rows during extraction.
Click Save to apply the configuration.
Navigate to the Table Status tab, select the table, and click Sync to begin extraction.

Manual Slicing Example

Consider a table named customers with 10,000 records, where customer_id ranges from 1 to 10,000. Select customer_id as the slice column. In the Slices field, enter:

1 2500 5000 7500 99999

This configuration creates 4 slices: >= 1 and < 2500, >= 2500 and < 5000, >= 5000 and < 7500, >= 7500 and < 99999.

The final boundary is set to a high value to ensure that all records are covered.

Automatic Slicing

Automatic slicing analyzes data distribution and calculates optimal boundaries.

The following inputs are required:

Slice Column: The column to slice on (e.g., customer_id, order_date, last_name).
Number of Slices: The desired number of slices (e.g., 10).
Prefix(chars): Number of leading digits or characters used to calculate slice boundaries.

To configure and run automatic slicing:

Navigate to the Table Setup tab in XL Ingest and select the table from the list on the left.
In the Table Settings panel, select a column to use for slicing.
Click Auto-Slice in the Define Slices section to open the Auto-Slice dialog.
Click Schedule Count to retrieve table statistics from the source (count, max, min, size, time).
Enter the desired No. of Slices (total number of partitions to create).

Note: The actual number of slices created in automatic slicing may vary based on your data distribution.
Enter Prefix (chars) to control slice boundary granularity. This value determines how many leading characters or digits are used to group values into slices.
Review the calculated values including records per slice, slice size, and time per slice.
Click Schedule Slice to generate the slices.
Navigate to the Table Status tab, select the table, and click Sync to begin extraction.

Automatic Slicing Example

Consider a table named orders with 10,000 records. Select order_id as the slice column. Click Schedule Count to see statistics. Enter 4 for No. of Slices and 2 for Prefix(chars), then click Schedule Slice.

The system automatically calculates slices based on data distribution and prefix size.

Using SQL on the Source to Compute Slice Boundaries

When using manual slicing, you can calculate boundaries with assistance from the source SQL database. You can create boundaries for slices of roughly equal size (an approximation of an equi-depth histogram, a type of histogram where the bars are of roughly equal length, and the boundaries of each bar vary in width) using SQL with the NTILE window function as follows:

Set number_of_buckets to the (size of your data / 10GB ) + 1 to create slices of size just under 10GB for large tables.

SQL

WITH RankedData AS (
	SELECT
	    your_column,
	    NTILE(number_of_buckets) OVER (ORDER BY your_column)
	    AS bucket_number
	FROM your_table
),
BucketBoundaries AS (
	SELECT
            bucket_number,
            MIN(your_column) AS lower_bound,
            MAX(your_column) AS upper_bound,
            COUNT(*) AS row_count
	FROM
	    RankedData
	GROUP BY
	    bucket_number
)
SELECT
	bb.bucket_number,
	bb.lower_bound,
	bb.upper_bound,
	bb.row_count
FROM
	BucketBoundaries bb
ORDER BY
	bb.bucket_number;

Note: If your slice column uses UUID or other data types that do not support MIN/MAX functions directly, cast the column to a string type.

Consider the following table:

SQL

CREATE TABLE orderdetails (
    orderNumber      int(11)        NOT NULL DEFAULT '0',
    productCode      varchar(15)    CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL DEFAULT '',
    quantityOrdered  int(11)        DEFAULT NULL,
    priceEach        decimal(10,2)  DEFAULT NULL,
    orderLineNumber  smallint(6)    DEFAULT NULL,
    PRIMARY KEY (orderNumber, productCode)
);

To calculate 10 slices on orderNumber column, you can run:

SQL

WITH RankedData AS (
   SELECT
       orderNumber,
       NTILE(10) OVER (ORDER BY orderNumber) AS bucket_number
   FROM
       orderdetails
),
BucketBoundaries AS (
   SELECT
       bucket_number,
       MIN(orderNumber) AS lower_bound,
       MAX(orderNumber) AS upper_bound
   FROM
       RankedData
   GROUP BY
       bucket_number
)
SELECT
   CONCAT(
       GROUP_CONCAT(lower_bound ORDER BY bucket_number SEPARATOR ' '),
       ' ',
       MAX(upper_bound) + 1
   ) AS boundaries
FROM
   BucketBoundaries;

Note: This example uses MySQL syntax. If using PostgreSQL, replace GROUP_CONCAT with STRING_AGG. For other databases, refer to your database documentation for the equivalent aggregation function.

This query returns a space-separated list of boundaries:

10100 10135 10167 10194 10226 10263 10293 10322 10355 10389 10426

The orderNumber column in this example has a minimum value of 10100 and a maximum value of 10425. If new records may be inserted during extraction, configure the last boundary to a higher value to ensure all records are included.

Sampling to Speed Up Boundary Calculation

For extremely large tables, you can use sampling on the source database table in SQL to speed up the scan. For example, sampling at one-hundredth of a percent on a 50 million row table can produce results with slice boundaries nearly identical to a full scan while reducing scan time to a small fraction of the full scan time.

Do not use sampling to calculate the minimum and maximum values of the slice column, as this can miss the actual minimum and maximum. Instead, calculate these values directly with a separate query or use domain knowledge to determine appropriate bounds for your data range.

Consult your source database documentation for details on sampling. For example, PostgreSQL supports the TABLESAMPLE clause for this purpose.