Replicate MongoDB® Collections using SQL

Ensure that the Prerequisites are met.
(Optional) Create a link to the MongoDB® instance. Refer to CREATE LINK for more information. You can also specify the link configuration and credentials in the CONFIG/CREDENTIALS clause of the CREATE {TABLES | TABLE} AS INFER PIPELINE SQL statement instead of creating a link.
Create the required table(s), stored procedure(s), and pipeline(s) using the CREATE {TABLES | TABLE} AS INFER PIPELINE SQL statement. Refer to Syntax for more information. You can either replicate the MongoDB® collections as is or apply custom transformations.
Note
Before restarting the INFER PIPELINE operation, delete all the related artifacts.
SQL
DROP TABLE <target_table_name>;
DROP PIPELINE <pipeline_name>;
DROP PROCEDURE <procedure_name>;
Once all the components are configured, start the pipelines.
- To start all the pipelines, run the START ALL PIPELINES SQL statement.
- To start a specific pipeline, run the START PIPELINE <pipeline_name> SQL statement. By default, the pipeline is named <source_db_name>.<table_name>.

To ingest data in the JSON format instead of BSON, you need to manually create the required table structures, pipelines, and stored procedures for mapping the BSON data type to JSON.

For more information, refer to the relevant section on this page:

Replication Strategies
CDC Snapshot Strategies
Configure Ingestion Speed Limit using Engine Variables

Replicate MongoDB® Collections Example

The following example shows how to replicate MongoDB® collections without any custom transformations. This example uses the LINK clause to specify the MongoDB® Atlas endpoint connection configuration.

Create a link to the MongoDB® endpoint, for example, the primary node of a MongoDB® Atlas cluster.

SQL

CREATE LINK <linkname> AS MONGODB
  CONFIG '{"mongodb.hosts":"<Hostname>",
    "collection.include.list": "<Collection list>",
    "mongodb.ssl.enabled":"true",
    "mongodb.authsource":"admin"}'
  CREDENTIALS '{"mongodb.user":"<username>",
    "mongodb.password":"<password>"}';

Create tables, pipelines, and stored procedures in SingleStore Helios based on the inference from the source collections.
SQL
```
CREATE TABLES AS INFER PIPELINE AS LOAD DATA 
  LINK <linkname> '*' FORMAT AVRO;
```
Once the link and the tables are created, run the following command to start all the pipelines and begin the data replication process:
SQL
```
## Start pipelines
START ALL PIPELINES;
```
To view the ingested BSON data, SingleStore recommends the following:
- Use the Kai Shell or other supported MongoDB® tools, such as MongoDB® Compass.
- Cast the columns to JSON using the following SQL command:
  SQL
```
SELECT _id :> JSON , _more :> JSON FROM <table_name>;
```

Syntax

SQL

CREATE TABLE [IF NOT EXISTS] <table_name> 
  AS INFER PIPELINE 
  AS LOAD DATA <mongodb_configuration>
  FORMAT AVRO;

-- Use either the LINK or MONGODB clause, they are mutually exclusive -- 
<mongodb_configuration>:
  LINK <link_name> "<source_db>.<source_collection>" 
  | MONGODB "<source_db>.<source_collection>" CONFIG <config_json> CREDENTIALS <credentials_json>

SQL

CREATE TABLES [IF NOT EXISTS] 
  AS INFER PIPELINE 
  AS LOAD DATA <mongodb_configuration>  
  FORMAT AVRO;

-- Use either the LINK or MONGODB clause, they are mutually exclusive --
<mongodb_configuration>:
  LINK <link_name> "*" 
  | MONGODB '*' CONFIG <config_json> CREDENTIALS <credentials_json>

`CREATE TABLE ... AS INFER PIPELINE` Behavior

The CREATE TABLE [IF NOT EXISTS] <table_name> AS INFER PIPELINE statement,

Connects to the MongoDB® servers using the specified LINK <link_name> or MONGODB <collection> CONFIG <conf_json> CREDENTIALS <cred_json> clause.
Discovers the available databases and collections filtered by collection.include.list.
Infers the schema of the collection and then creates a table (named <table_name>) in SingleStore using the inferred schema. You can also specify a table name that differs from the name of the source MongoDB® collection. If the specified table already exists, a new table is not created and the existing table is used instead.
Creates a pipeline (named <source_db_name>.<table_name>) and stored procedure (named <source_db_name>.<table_name>) that maps the AVRO data structure to the SingleStore data structure. The [IF NOT EXISTS] clause is ignored for pipelines and stored procedures. If a pipeline or stored procedure with the same name already exists, the CREATE TABLE ... AS INFER PIPELINE statement returns an error.

`CREATE TABLES AS INFER PIPELINE` Behavior

The CREATE TABLES [IF NOT EXISTS] AS INFER PIPELINE statement creates a table for each collection in the source database using the same set of operations as the CREATE TABLE [IF NOT EXISTS] <table_name> AS INFER PIPELINE statement (specified above).

Arguments

<table_name>: Name of the table to create in the SingleStore Helios database. You can also specify a table name that differs from the name of the source MongoDB® collection.
<link_name>: Name of the link to the MongoDB® endpoint. Refer to CREATE LINK for more information.
<collection>: Name of the source MongoDB® collection.
<config_json>: Configuration parameters, including the source MongoDB® configuration, in the JSON format. Refer to Parameters for supported parameters.
<credentials_json>: Credentials to use to access the MongoDB® database, in JSON format. For example:
```
CREDENTIALS '{"mongodb.password": "<password>", "mongodb.user": "<user>"}'
```
- mongodb.user: The name of the database user to use when connecting to MongoDB® servers.
- mongodb.password: The password to use when connecting to MongoDB® servers.

Parameters

The CREATE {TABLE | TABLES}, CREATE LINK. and CREATE AGGREGATOR PIPELINE statement supports the following parameters in the CONFIG clause:

mongodb.hosts: A comma-separated list of MongoDB® servers (nodes) in the replica set, in 'hostname:[port]' format.

The mongodb.connection.string and mongodb.hosts parameters are mutually exclusive, i.e., they cannot be used in the same CREATE TABLE ... AS INFER PIPELINE statement.
- If mongodb.members.auto.discover is set to FALSE, you must prefix the 'hostname:[port]' with the name of the replica set in mongodb.hosts, e.g., rset0/svchost-xxx:27017. The first node specified in mongodb.hosts is always selected as the primary node.
- If mongodb.members.auto.discover is set to TRUE, you must specify both the primary and secondary nodes in the replica set in mongodb.hosts.
mongodb.connection.string: Specifies the URI of the remote MongoDB® instance. This parameter supports both the standard and SRV connection string formats. The mongodb.connection.string and mongodb.hosts parameters are mutually exclusive, i.e., they cannot be used in the same CREATE TABLE ... AS INFER PIPELINE statement.
mongodb.members.auto.discover: Specifies whether the MongoDB® servers defined in mongodb.hosts should be used to discover all the members of the replica set. If disabled, the servers are used as is.
mongodb.ssl.enabled: Enables the connector to use SSL when connecting to MongoDB® servers.
mongodb.authsource: Specifies the database containing MongoDB® credentials to use as an authentication source. This parameter is only required when the MongoDB® instance is configured to use authentication with an authentication database other than admin.
mongodb.socket.timeout.ms: Specifies the socket timeout (in milliseconds) for connections to a MongoDB® instance. The pipeline returns an error if there is no response from the MongoDB® server within the specified timeout. By default, mongodb.socket.timeout.ms is set to 12000 (2 minutes).

Note: SingleStore does not recommend updating this parameter unless troubleshooting unusual behavior.
collection.include.list: A comma-separated list of regular expressions that match fully-qualified namespaces (in databaseName.collectionName format) for MongoDB® collections to monitor. By default, all the collections are monitored, except for those in the local and admin databases. When this option is specified, collections excluded from the list are not monitored. The collection.include.list and collection.exclude.list parameters are mutually exclusive, i.e., they cannot be used in the same CREATE TABLE ... AS INFER PIPELINE statement. This parameter is only supported in CREATE TABLE ... AS INFER PIPELINE statements.
collection.exclude.list: A comma-separated list of regular expressions that match fully-qualified namespaces (in databaseName.collectionName format) for MongoDB® collections to exclude from the monitoring list. By default, this list is empty. The collection.include.list and collection.exclude.list parameters are mutually exclusive, i.e., they cannot be used in the same CREATE TABLE ... AS INFER PIPELINE statement. This parameter is only supported in CREATE TABLE ... AS INFER PIPELINE statements.
database.include.list (Optional): A comma-separated list of regular expressions that match the names of databases to monitor. By default, all the databases are monitored. When this option is specified, databases excluded from the list are not monitored. The database.include.list and database.exclude.list parameters are mutually exclusive, i.e., they cannot be used in the same CREATE TABLE ... AS INFER PIPELINE statement. If this option is used with the collection.include.list or collection.exclude.list option, it returns the intersection of the matches. This parameter is only supported in CREATE TABLE ... AS INFER PIPELINE statements.
database.exclude.list (Optional): A comma-separated list of regular expressions that match the names of databases to exclude from monitoring. By default, this list is empty. The database.include.list and database.exclude.list parameters are mutually exclusive, i.e., they cannot be used in the same CREATE TABLE ... AS INFER PIPELINE statement. If this option is used with the collection.include.list or collection.exclude'list option, it returns the intersection of the matches. This parameter is only supported in CREATE TABLE ... AS INFER PIPELINE statements.
signal.data.collection (Optional): A collection in the remote source that is used by SingleStore to generate special markings for snapshotting and synchronization. By default, this parameter is set to singlestore.signals_xxxxxx, where xxxxxx is an automatically generated character sequence. The default signal collection is in the database named singlestore. The MongoDB® user must have write permissions to this collection. Once the pipelines are started, any change to the value of this parameter is ignored, and the pipelines use the latest value specified before the pipelines started.
max.queue.size: Specifies the size of the queue inside the extractor process for records that are ready for ingestion. The default queue size is 1024. This variable also specifies the number of rows for each partition. Increasing the queue size results in an increase in the memory consumption by the replication process and you may need to increase the pipelines_cdc_java_heap_size.
max.batch.size: Specifies the maximum number of rows of data fetched from the remote source in a single iteration (batch). The default batch size is 512. max.batch.size must be lower than max.queue.size.
poll.interval.ms: Specifies the interval for polling of remote sources if there were no new records in the previous iteration in the replication process. The default interval is 500 milliseconds.
snapshot.mode: Specifies the snapshot mode for the pipeline. It can have one of the following values:
- "initial" (Default): Perform a full snapshot first and replay CDC events created during the snapshot. Then, continue ingestion using CDC.
- "incremental": Start the snapshot operation and CDC simultaneously.
- "never": Skip the snapshot, and ingest changes using CDC.
Refer to CDC Snapshot Strategies for more information.

Replication Strategies

Use one of the following methods to create the required components for data ingestion.

Replicate MongoDB® Collections As Is

To replicate or migrate MongoDB® collections as is, use the CREATE {TABLES | TABLE} AS INFER PIPELINE SQL statement. This method automatically creates the required tables, pipelines, and stored procedures. Refer to Syntax for more information.

Apply Transformations or Ingest a Subset of Columns

To apply transformations or ingest only a subset of collections, manually create the required tables, stored procedure, and pipelines:

Run the CREATE {TABLES | TABLE} AS INFER PIPELINE SQL statement to infer the schema of the MongoDB® collection(s) and automatically generate templates for the relevant table(s), stored procedure(s), and aggregator pipeline(s).
Use the automatically-generated templates as a base to create a new table(s), stored procedure(s), and pipeline(s) for custom transformations. To inspect the generated table(s), stored procedure(s), and pipeline(s), use the SHOW CREATE TABLE , SHOW CREATE PROCEDURE, and SHOW CREATE PIPELINE commands, respectively. After running the SHOW commands, you can drop the templates and then recreate the same components with custom transformations.

Using the automatically-generated templates:
1. Create table(s) in SingleStore with a structure that can store the ingested MongoDB® collection. Refer to CREATE TABLE for more information.
2. Create stored(s) procedure to map the MongoDB® collection to the SingleStore table and implement other transformations required. Refer to CREATE PROCEDURE for information on creating stored procedures.
3. Create pipeline(s) to ingest the MongoDB® collection(s) using the CREATE AGGREGATOR PIPELINE SQL statement.
  
  Refer to Parameters for a list of supported parameters. Refer to CREATE PIPELINE for the complete syntax and related information.
  
  Note: The CDC feature only supports AGGREGATOR pipelines.

Refer to Syntax for information on CREATE {TABLES | TABLE} AS INFER PIPELINE SQL statement.

CDC Snapshot Strategies

SingleStore supports the following strategies for creating snapshots:

Perform a full snapshot before CDC ("snapshot.mode":"initial"):

The pipeline captures the position in the oplog and then performs a full snapshot of the data. Once the snapshot is complete, the pipeline continues ingestion using CDC. This strategy is enabled by default. If the pipeline is restarted while the snapshot is in progress, the snapshot is restarted from the beginning.

If the initial snapshot is large in size and the deployment is prone to restarts or connection issues from the source, SingleStore recommends using the incremental snapshot mode. Note that the incremental snapshot mode is slower than the initial mode. For faster data ingestion when the initial historical data is large in size, manually perform the snapshot and capture changes using the never mode.

To use this strategy, set "snapshot.mode":"initial" in the CONFIG JSON.

Requirement: The oplog retention period must be long enough to maintain the records while the snapshot is in progress. Otherwise, the pipeline will fail and the process will have to be started over.
CDC only ("snapshot.mode":"never"):

The pipeline will not ingest existing data, and only the changes are captured using CDC.

To use this strategy, set "snapshot.mode":"never" in the CONFIG JSON.
Perform a snapshot in parallel to CDC ("snapshot.mode":"incremental"):

The pipeline captures the position in the oplog and starts capturing the changes using CDC. In parallel, the pipeline performs incremental snapshots of the existing data and merges it with the CDC records. Although this strategy is slower than performing a full snapshot and then ingesting changes using CDC, it is more resilient to pipeline restarts.

To use this strategy, set "snapshot.mode":"incremental" in the CONFIG JSON.

Requirement: The oplog retention period must be long enough to compensate for unexpected pipeline downtime.
Manually perform the snapshot and capture changes using the CDC pipeline:

To use this strategy, set "snapshot.mode":"never" in the CONFIG JSON.
1. Create a pipeline and then wait for at least one batch of ingestion to capture the oplog position.
2. Stop the pipeline.
3. Snapshot the data using any of the suitable methods, for example, mongodump.
4. Restore the snapshot in SingleStore using any of the supported tools, for example mongorestore.
5. Start the CDC pipeline.
This strategy provides faster data ingestion when the initial historical data is very large in size.

Requirement: The oplog retention period must be long enough to maintain the records while the snapshot is in progress. Otherwise, the pipeline will fail and the process will have to be started over.

Configure Ingestion Speed Limit using Engine Variables

Use the following engine variables to configure ingestion speed:

Variable Name	Description	Default Value
`pipelines_cdc_row_emit_delay_us`	Specifies a forced delay in row emission while migrating/replicating your tables (or collections) to your SingleStore Helios databases. It can have a maximum value of `1000000`.	`1`
`pipelines_cdc_java_heap_size`	Specifies the JVM heap size limit (in MBs) for CDC-in pipelines.	`128`
`pipelines_cdc_max_extractors`	Specifies the maximum number of CDC-in extractor instances that can run concurrently.	`16`
`pipelines_cdc_min_extractor_lifetime_s`	Specifies the minimum duration (in seconds) that the extractor allocates to a single pipeline for ingesting data and listening to CDC events.	`60`

In-Depth Variable Definitions

Use the pipelines_cdc_row_emit_delay_us engine variable to limit the impact of CDC pipelines on the master aggregator node. It specifies a forced delay in row emission during ingest. This variable can be set to a maximum value of 1000000.

Use the max.queue.size parameter in the CONFIG JSON to control the ingestion speed. SingleStore recommends setting max.batch.size to half of max.queue.size. Increasing the queue size requires a larger Java heap limit, adjust the pipelines_cdc_java_heap_size engine variable accordingly. Query the INFORMATION_SCHEMA.PIPELINES_BATCHES_SUMMARY table for information on pipeline batch performance.

Use the pipelines_cdc_max_extractors engine variable to limit the number of CDC-in extractor instances that can run concurrently. If the number of CDC-in pipelines is greater than pipelines_cdc_max_extractors, some pipelines will have to wait in the queue until an extractor can be acquired to fetch data. This variable can be set to a maximum value of 1024.

Use the pipelines_cdc_min_extractor_lifetime_s variable to specify the minimum duration (in seconds) that the extractor allocates to a single pipeline for ingesting data and listening to CDC events. This variable can be set to a maximum value of 3600.

Optimize CDC-in Pipelines

Pipelines and Extractors

CDC-in pipelines ("pipelines") are aggregator pipelines that run on the Master Aggregator (MA). Each pipeline extracts data from a single source table and loads data into a single SingleStore table. Extractors are subprocesses that extract data from the source and provide the data to the pipelines. The extractors are shared between the CDC-in pipelines.

Because the MA is limited in resources, it can only run a limited number of CDC-in extractors. If the number of active pipelines exceeds the maximum number of extractors (pipelines_cdc_max_extractors), then the extractors extract data to pipelines_cdc_max_extractors number of pipelines for pipelines_cdc_min_extractor_lifetime_s duration, before moving on to the next set of pipelines in the queue. During this time, the pipelines waiting for their turn may create cancelled batches.

For example, if the total number of pipelines is 50, pipelines_cdc_max_extractors is set to 10, and pipelines_cdc_min_extractor_lifetime_s is set to 60, then the extractors allocate resources to the first 10 pipelines for 60 seconds, then to the next 10 pipelines in the queue for the next 60 seconds, and so on.

SingleStore recommends ingesting a limited number of tables using CDC-in pipelines.

Memory and Resource Consumption

Each extractor consumes a persistent amount of resources (approximately pipelines_cdc_java_heap_size per extractor). Maximum memory consumption by the pipelines on the MA is approximately pipelines_cdc_max_extractors * pipelines_cdc_java_heap_size MB. It is used to store the queue of rows for subsequent batches of ingestion.

Note: Total memory consumption may be higher and includes memory usage by the static system memory, shared libraries, JVM heap, etc.

To reduce the resource consumption on the MA, either reduce the maximum number of extractors (pipelines_cdc_max_extractors) or the JVM heap size (pipelines_cdc_java_heap_size). Although, reducing either of these limits may also reduce the ingestion speed.

Note: If JVM heap size is reduced, you may also need to reduce the pipeline batch and queue size.

Troubleshooting

If the CREATE {TABLES | TABLE} AS INFER PIPELINE SQL statement returns an error, run the SHOW WARNINGS command to view the reason behind the failure.
To view the status of the pipelines, query the information_schema.PIPELINES_CURSORS table. Run the following SQL statement to display the status of the replication task:
SQL
```
SELECT SOURCE_PARTITION_ID, 
 EARLIEST_OFFSET,
 LATEST_OFFSET,
 LATEST_EXPECTED_OFFSET-LATEST_OFFSET as STATUS,
 UPDATED_UNIX_TIMESTAMP 
FROM information_schema.PIPELINES_CURSORS;
```
The value in the STATUS column indicates the following:
- 1: Indicates that the snapshot is in progress and the pipeline is expecting more data.
- 0: Indicates that the initial snapshot is complete and the pipeline is capturing changes using CDC.
Note: If snapshot.mode is set to incremental, the pipeline performs incremental snapshots in parallel to capturing changes via CDC. In this case, the value 1 in the STATUS column indicates that the pipeline is performing incremental snapshots and capturing changes in parallel.
To view pipeline errors, run the following SQL statement:
SQL
```
SELECT * FROM information_schema.PIPELINES_ERRORS;
```
If a pipeline fails with an out of memory error in Java, either increase the heap size using the pipeline_cdc_java_heap_size engine variable or reduce the max.queue.size and max.batch.size parameters in the CONFIG clause. The heap size is limited by the memory available on the MA. SingleStore recommends setting the queue size as double of the batch size.

Example

Example 1 - Create Tables with Different Names from the Source Collection

To create tables in SingleStore with names that differ from the name of the source MongoDB® collection, use the following syntax:

SQL

CREATE TABLE <new_table_name> AS INFER PIPELINE AS LOAD DATA 
LINK <link_name> '<source_db.source_collection>' FORMAT AVRO;

You can also use this command to import collections if a table with the same name already exists in SingleStore. Additionally, you can use this syntax to reimport a collection with a distinct table name.

Example 2 - Use Regular Expressions to Include or Exclude Collections or Databases

The names of the databases or collections to include or exclude are specified using regular expressions. For example, to include or exclude all the collections in a database named dbTest, use the following regular expression:

dbTest[.].* 
-- OR -- 
dbTest\..*

where,

dbTest matches the exact string dbTest.
[.] and \. match the character dot (.), which represents the dot (.) in <database_name>.<collection_name> notation.

Note that . is a special character in regular expressions, and to match the character . it must either be escaped (\.) or specified as a literal character ([.]).
.* matches any sequence of characters, because the dot (.) matches any single character and the asterisk (*) matches zero or more occurrences of any character.

Here's a sample CREATE LINK statement to include all the collections in the dbTest database, for example, dbTest.foo, dbTest.bar, dbTest.exampleCollection, etc.:

SQL

CREATE LINK mongoRepl AS MONGODB 
CONFIG '{ 
    "mongodb.connection.string": "mongodb+srv://cluster0.mongodb.net/", 
    "collection.include.list": "dbTest\..*", 
    "mongodb.ssl.enabled": "true", 
    "mongodb.authsource": "admin" 
}' 
CREDENTIALS '{ 
    "mongodb.user": "<username>", 
    "mongodb.password": "<password>" 
}';

To specify the collections or databases to include or exclude from the replication task, use either (or a combination) of the following parameters: collection.include.list, collection.exclude.list, database.include.list, or database.exclude.list. Refer to Parameters for more information.

Replicate MongoDB® Collections using SQL

On this page

Replicate MongoDB® Collections Example

Syntax

`CREATE TABLE ... AS INFER PIPELINE` Behavior

`CREATE TABLES AS INFER PIPELINE` Behavior

Arguments

Parameters

Replication Strategies

Replicate MongoDB® Collections As Is

Apply Transformations or Ingest a Subset of Columns

CDC Snapshot Strategies

Configure Ingestion Speed Limit using Engine Variables

In-Depth Variable Definitions

Optimize CDC-in Pipelines

Pipelines and Extractors

Memory and Resource Consumption

Troubleshooting

Example

Example 1 - Create Tables with Different Names from the Source Collection

Example 2 - Use Regular Expressions to Include or Exclude Collections or Databases

Was this article helpful?

On this page

Was this article helpful?

Replicate MongoDB® Collections using SQL

On this page

Replicate MongoDB® Collections Example

Syntax

CREATE TABLE ... AS INFER PIPELINE Behavior

CREATE TABLES AS INFER PIPELINE Behavior

Arguments

Parameters

Replication Strategies

Replicate MongoDB® Collections As Is

Apply Transformations or Ingest a Subset of Columns

CDC Snapshot Strategies

Configure Ingestion Speed Limit using Engine Variables

In-Depth Variable Definitions

Optimize CDC-in Pipelines

Pipelines and Extractors

Memory and Resource Consumption

Troubleshooting

Example

Example 1 - Create Tables with Different Names from the Source Collection

Example 2 - Use Regular Expressions to Include or Exclude Collections or Databases

Was this article helpful?

On this page

Was this article helpful?

`CREATE TABLE ... AS INFER PIPELINE` Behavior

`CREATE TABLES AS INFER PIPELINE` Behavior