Augmenting Your Data Warehouse to Accelerate BI

Welcome to SingleStore Helios! We put together a guide to help you augment your data warehouse by migrating some of your BI workloads to SingleStore. Use this guide to follow a top-down approach to:

  • Preparing your data from your existing data warehouse

  • Interacting with SingleStore

  • Designing your SingleStore tables to accelerate your dashboards

  • Incorporating streaming data with SingleStore

  • Optimizing your queries for your dashboards

  • Connecting your dashboard to SingleStore

Tutorial Outline

This tutorial is divided into the following sections:

Getting Started

  • Benefits of leveraging SingleStore Helios for your BI dashboard

    • Why users leverage SingleStore to augment their existing data warehouse technologies

  • What to consider when bringing your data from:

    • Snowflake, Redshift, BigQuery, Synapse

    • AWS S3, Google CS, Azure, Kafka

  • What to consider when designing your database coming from:

    • MySQL, MariaDB, Postgres

Sample Architecture

Preparing Your Data for SingleStore

Database Administration

  • Cloud Portal

  • Drivers

  • MySQL Workbench

  • MySQL Command Line

Designing Your Tables

  • Data types

  • Designing a schema for an application

    • Sharding for application workloads

    • Sort keys for application workloads

    • Reference Tables

Data Ingest

  • How this may be different from ingesting directly to the data warehouse

  • Using SingleStore Pipelines

  • Using INSERT Statements

Testing Your Queries and Performance

  • Run queries

  • Visual Explain

  • Concurrency test your queries using dbench based on app expectations

Writing Back to Your Application

Connecting Your BI Tools

  • Connectivity samples for Tableau, Power BI, Looker

Getting Started

We are excited for you to get started accelerating your BI dashboards by augmenting your cloud data warehouse with SingleStore Helios (SSMS). We have many users that have started their BI projects with data warehouses like Redshift, Snowflake and BigQuery, and realized that due to the need for fast ingest, performance, and high concurrency, they needed to augment their architecture with a highly performant, scalable cloud-native database.

Some terminology used in this guide:

  • App/Application: Web- or mobile-based, customer or internal facing application.

  • Dashboard: Visual analytics displayed either through commercial business intelligence tools or custom-built solutions.

  • Object storage: Cloud-based data repository.

Data warehouse users often find improved query latency and concurrency support using SingleStore Helios as the data store directly fueling their dashboards, while retaining their data warehouse for long term storage. Not only that, but data ingested directly into SingleStore is immediately able to be queried, rather than waiting minutes or hours for batch ingest into the data warehouse to complete.

Let’s get started by discussing some of the basic things to consider when bringing your data over from one of these data warehouses.

Things to Consider

For users starting with a database already hosted in a cloud provider like AWS, Azure, or GCP, you may already have your data sitting in object storage (if not, we walk through how to do this later). SingleStore has a feature called Pipelines, which allows you to bring in data from any of these places very quickly. We’ll go through how to do this in a bit.

Now, of course, every database is different. Let’s talk a bit about database design differences.

Things to know, regardless of your current platform:

  • SingleStore Helios can be deployed in any of the three major clouds, in any region – so you don’t have to worry about moving regions.

  • Cloud Enterprise Data Warehouses (EDWs) typically leverage ANSI SQL, and SingleStore does too. However, SingleStore leverages MySQL syntax so there may be some modifications required. SingleStore is also ACID compliant.

Things to know for specific databases:

Snowflake

Users often use SingleStore Helios in conjunction with Snowflake to achieve improve concurrency and ingest SLAs at a lower overall TCO.

Redshift

  • Users often use SingleStore Helios in conjunction with Redshift to improve overall query latency for workloads with transactions and analytics.

  • Many users come to SingleStore Helios after they have reached the Redshift column limitation as well.

  • Redshift’s "leader nodes" and "compute nodes" correspond to SingleStore Helios’s aggregator nodes and leaf nodes. Aggregators handle client communication and query orchestration, and leaf nodes manage data and compute power. You don’t need to know too much here given that this is a managed service, but you’ll see the node types denoted within the "Nodes" section of the Cloud Portal.

BigQuery

Users typically find SingleStore Helios to be faster in overall ingest speeds and query latency with concurrency, but may find BigQuery simpler to implement with the overall GCP stack.

Azure Synapse

Synapse’s "control nodes" and "compute nodes" correspond to SingleStore Helios’s "aggregator nodes" and "leaf nodes". Aggregators handle client communication and query orchestration, and leaf nodes manage data and compute power. You don’t need to know too much here given that this is a managed service, but you’ll see the node types denoted within the "Nodes" section of the Cloud Portal.

Sample Architectures

Many of our users experience slow dashboard response times when overwhelming their data warehouses with various different speeds and sizes of data.

Keeping those data warehouses in place and augmenting them with a high performance database built for fast ingest and concurrency helps our users not only save money, but deliver a stellar user experience.

Preparing Your Data for SingleStore

SingleStore has simple, powerful methods of bringing data in from object storage. Here are some options for getting your existing databases exported to object storage in places like S3, GCS and Azure Blob Storage.

Is your data already in cloud storage? Feel free to move on to the next section.

Existing Managed Cloud Databases:

For other databases that you’re looking to migrate, you can still export that data to CSV, JSON, etc. and then upload it to object storage. Typically this would involve something like a SELECT ... INTO OUTFILE ... from your existing database. From there, you can upload those files to object storage.

If you are importing data from MySQL or MariaDB, you can perform a simple mysqldump as listed in Transition from MySQL to SingleStore Helios.

Don’t want to do a bulk import? That’s fine too! After we discuss schema design, we’ll walk you through connecting directly from your application and writing data.

Database Administration

Now that we have identified our data source, let’s talk a bit about how we’ll interact with SingleStore.

When you signed up in Cloud Portal, you got access to our SQL Editor and Visual Explain tools. The Cloud Portal is the best place to interact with your SingleStore data, build new data pipelines, and test out queries. Make sure you explore the features in the Cloud Portal, as well as the Tutorials available through the Help button.

Ensure that you have the appropriate client drivers prior to using other database administration tools.

MySQL Workbench

If you’re coming from MariaDB or MySQL, you may already be comfortable with Workbench. You can download MySQL Workbench here.

Note: When defining your connection, you will need to go to the Advanced tab and insert defaultAuth=mysql_native_password in the "Others:" field to ensure proper authentication.

MySQL Command Line

Within your Cloud Portal, your workspace details will include a MySQL Command section. This gives you an easy way to use the command line to connect to SingleStore. You can find the download for the MySQL shell here.

Note: When defining your connection, you will need to enter this at the MySQL command line: --defaultAuth=mysql_native_password to ensure proper authentication.

Designing Your Tables

At this point, you should be using the SQL Editor or some other MySQL client to work with your cloud database. Before checking out how fast SingleStore can bring in your data, let’s make sure your tables are designed optimally for your application workload.

By default within SingleStore Helios, database tables are created using our Universal Storage format (i.e., disk-based columnstore). There are a few important things to consider when designing tables to house your data:

Data Types

If you’re familiar with relational databases, you may not need too much guidance on data types.

In addition to data types traditionally supported in relational databases, SingleStore also supports JSON and geospatial data types. Read about Data Types in SingleStore.

Shard Key

This key determines how data is distributed across the database workspace, and is critical to ensure that your data isn’t skewed. Skewed data can lead to longer query times. Data contained within unique values of your shard key will reside in individual partitions of the database. Well distributed data will ensure scalability and parallelism when fulfilling diverse requests concurrently.

  • Snowflake users may be familiar with micro-partitioning and workspaceing to distribute data. If you’ve perhaps found these difficult to maintain on ever-changing data, you are in the right place.

  • Redshift users would be familiar with the various mechanisms listed here for data distribution across a workspace.

  • BigQuery users will be familiar with the concept of sharding from partitioned tables.

  • Synapse users will be familiar with the concept of sharding, though SingleStore Helios gives you a bit more flexibility here on data distribution.

So how do I pick a shard key best for my application workload?

  • If you have a primary key, make sure the shard key is a subset of it (because cardinality matters!).

  • If your BI queries include frequent joins or filters on a specific set of columns, make sure the shard key is a subset of those (in both tables, for joins).

  • For highly concurrent workloads, make sure your shard key allows your queries to be single partition (i.e., join or filter columns within shard key).

In this example, we use user_id as our shard key, which works nicely given its high cardinality as a part of this dataset. All records with the same user_id will be maintained together, which will improve query response time.

CREATE TABLE clicks (
click_id BIGINT AUTO_INCREMENT,
user_id INT,
page_id INT,
ts TIMESTAMP,
SHARD KEY (user_id),
SORT KEY (click_id, user_id)
);

Columnstore Key

In addition to identifying your shard key, it’s important to tell SingleStore how you would like to sort your data within each data segment. This helps SingleStore enable segment elimination, which ensures a minimal amount of data needs to be read for each query. This also helps SingleStore presort data for your queries.

If you’re familiar with BigQuery, you may note that this is similar to the concept of workspaceing.

So how do I pick a columnstore key best for my BI workload?

  • BI dashboards often require lots of filtering. If you have common filter columns, make sure those are in the columnstore key.

  • If you’re inserting in order by some column, it’s best to put that column first in the columnstore key.

  • Lower cardinality columns should be first in the columnstore key.

In this example, we use price as our sort key, so items are sorted in order of that column when queried.

CREATE TABLE products (
ProductId INT,
Color VARCHAR(10),
Price INT,
Qty INT,
SORT KEY (Price),
SHARD KEY (ProductId)
);

Reference Tables

If you have small, infrequently changing table(s) that are required for joins, consider making them Reference Tables.

  • Reference tables are a convenient way to recreate dimension tables that you may use in MySQL, MariaDB, or PostgreSQL.

  • Reference tables are replicated to each leaf in the workspace ensuring data does not need to go across the network between partitions to join data.

  • Reference table commands need to be run from the endpoint listed in the Cloud Portal.

Data Ingest

Now for the fun part, ingesting data! This is where things may look a bit different to you compared to other databases, because SingleStore has this unique ingest capability called Pipelines that supports high-frequency, parallel ingest of data from sources like S3, Azure Blob, GCS, Kafka, etc. Skip-list indexes and concurrent versioning technologies allow these writes to not lock tables, allowing reads to continue unimpacted during ingest.

Previously, we discussed how to bring your data from Snowflake, BigQuery, Redshift, etc. into object storage. Now it’s time to go fetch that data from your object storage and import it into SingleStore.

SingleStore Pipelines

To use a Pipeline to import data into SingleStore, write a CREATE PIPELINE statement using our SQL Editor or a MySQL client.

A few things to consider:

  • Make sure that your security settings in your blob storage will allow for access from SingleStore. For example, AWS S3 security settings can be found here.

  • Make sure your buckets are not public, but you should be able to obtain an access and secret key using the AWS doc here.

  • You can use wildcard notation when identifying your files from within the bucket.

Here’s an example of a CREATE PIPELINE statement:

CREATE PIPELINE library
AS LOAD DATA S3 'my-bucket-name'
CONFIG '{"region": "us-west-1", "suffixes": ["csv"]}'
CREDENTIALS '{"aws_access_key_id": "your_access_key_id", "aws_secret_access_key": "your_secret_access_key"}'
INTO TABLE `classic_books`
FIELDS TERMINATED BY ',';

GCP Documentation:

Pipeline Statements

Bucket Security

Azure Documentation:

Pipeline Statements

Bucket Security

Cloud Data Pipelines

Perhaps you’re interested in hooking up SingleStore directly to your cloud data pipelines. Below, we have a few examples of how to do that using commonly used frameworks. We are always adding more, so feel free to share feedback on which you’d like to see next.

AWS Glue

Check out our native integration with AWS Glue. This example will walk you through the value of using Glue with SingleStore, as well as step by step configuration, security and connectivity settings to use.

Google Cloud Dataflow

Users can easily integrate SingleStore into their Dataflow pipelines using the JDBCIO.Write class.

INSERT Statements

Perhaps you already have an application that is writing to your existing database, and you simply want to redirect the writes from that application to SingleStore. That’s great! Many of our users do this with the MySQL JDBC driver. You can find examples on writing your insert statements here.

Testing Your Queries

Here are some helpful hints for testing your queries.

Running Queries

Hopefully at this point you have your data in SingleStore. You can check this by running some basic SELECT statements within the SQL Editor. Next, you may want to try out some queries that you were running with your last database. Generally, you will want to run queries twice to get a true understanding of the runtime as the first run must create and cache the query plan.

Visual Explain

One great feature of SingleStore Helios is Visual Explain. If you encounter a situation in which your query is taking longer than expected, highlight the desired query in the SQL Editor and click the Visual Explain icon to the left of the Run button. The icon resembles a tree. After clicking, you may then choose between EXPLAIN and PROFILE.

Once you identify a bottleneck, you should be able to make changes either to your schema or to your query itself in order to improve speed.

You can manually (non-visually) run EXPLAIN or PROFILE from any client; see the links above for details on the commands.

Benchmarking

At SingleStore, we've developed an easy-to-use tool for benchmarking called dbbench. You can check that out here.

Once you've installed the packages to your host machine, you can walk through this tutorial. All you will have to do is change the host from 127.0.0.1 to the endpoint listed for your workspace in the Cloud Portal (it should look something like: svc-xxx-dml.aws-virginia-1.db.memsql.com). The port will remain 3306, as listed. You can then move onto testing based on your application workload.

Writing Back to Your Application

In the Sample Architectures section, we shared some general practices that our users have employed when augmenting their data warehouse environment with SingleStore. Given the nature of data warehouses and how they may serve other applications, you may want to write your data back to them even after putting them in SingleStore. We understand that, and have methods to do so.

The best way to write to a cloud data warehouse from SingleStore is again via object storage (i.e., S3, GCS, Azure Blob). SingleStore Helios enables users to SELECT… INTO… in order to take the output of a given query and drop it into object storage. This can also be bundled within a stored procedure for continuous updates. See SELECT for more information.

Connecting Your BI Tools

You’ve likely already selected an analytics partner, so we won’t go too deep into the similarities and differences between them. Based on our experience with BI and analytics technologies, we’ve linked to detailed guides below on how to get started on what to look out for when connecting them to SingleStore.

In these examples, you’ll be using your endpoint here provided in the Cloud Portal, along with the credentials you used when spinning up your workspace. You will want to make sure that the inbound IP whitelist within the Firewall section of the Cloud Portal allows communication between your BI and SingleStore Helios environments.

Tableau

Documentation: Connect with Tableau Desktop

Notes: Tableau has a limitation of auto updates at most one per hour, so you may not see your real-time data updating within your dashboard by the second. However, you can manually refresh in Tableau to see freshly ingested data.

Power BI

Documentation: Connect with Power BI

Notes: Make sure you are using DirectQuery mode to truly harness the power of SingleStore and all of the data you have stored. Otherwise, you will be using Import mode to bring only a subset of data into PowerBI. This will also help you harness the power of your real-time data.

Looker

Documentation: Connect with Looker

Notes: Looker allows dashboards to update up to the second, so be sure to watch the video in (b) to see how to configure that. This is very beneficial to showcase the speed of data ingest with SingleStore.

Additional Analytics and BI Tools

Connect to Analytics and BI Tools

Last modified: October 28, 2024

Was this article helpful?