Analyzing Time Series Data
On this page
Time-series data consists of sequences of events, where each event includes a timestamp.
Storing Time Series Data
SingleStore stores time-series data in rowstore or columnstore tables.DATETIME(6) to define the time attribute.DATETIME.TIMESTAMP because timestamp values are limited to dates through 2038.DATETIME(6) or DATETIME avoids future application changes related to this limitation.NOW().
The following example creates a table that stores time-series events generated by a wind turbine.
CREATE TABLE turbine_reading(tid int NOT NULL, -- turbine IDts datetime(6) NOT NULL,rpm double,temperature double,vibration double,output double,wind_direction double,wind_speed double,SHARD(tid),SORT KEY(ts));
This table uses a columnstore table by default.
If a time-series table contains a large number of attributes and queries frequently retrieve individual rows, consider adding a column group index to the columnstore table.
Consider using a rowstore table if either of the following conditions applies:
-
Your workload has a high update rate and the entire data set fits in available memory.
-
You need low-latency point-in-time lookups using
AS OFqueries.Refer to Finding a Row current AS OF a Point in Time for more information. Rowstore query processing can efficiently locate a small number of rows within a narrow time range and return rows with many attributes more efficiently than columnstore processing for this access pattern.
SingleStore recommends creating a SORT KEY on the ts column because time-series queries filter on ranges of timestamp values.SORT KEY stores the data in timestamp order, which allows the query engine to process range filters efficiently through segment elimination.KEY on ts creates an index that enables efficient processing of range filters.
For time-series data sets with a large number of attributes, where applications retrieve all table attributes and the data fits within the available table RAM, use a rowstore table to store the data.
Descriptive Data
Store descriptive properties that remain unchanged across time-series events in a separate table.
CREATE REFERENCE TABLE turbine(tid int,name varchar(60),model varchar(60),max_output double,latitude double,longitude double,PRIMARY KEY(tid));
Use a reference table for small collections of descriptive properties.
The descriptive data table described in the previous section serves as a dimension table that links to the fact table containing the time-series events.
The following example uses the sample turbine data in the table turbine:
INSERT INTO turbine VALUES(1, 'Hood River A', 'Volkswind Mega 5', 5.0, 47.130, 113.187),(2, 'Hood River B', 'Volkswind Mega 5+', 5.3, 47.141, 113.199);
Ingesting Time Series Data
SingleStore supports multiple methods for ingesting time-series data.
-
Pipelines: Use pipelines for continuous ingestion from files or Kafka topics and for high-volume data loads.
-
LOAD DATA: UseLOAD DATAfor bulk loading historical time-series data from files. -
INSERT: UseINSERTfor ingesting individual rows or small batches directly from an application.
You can use any combination of these methods.
Time Series Functions
SingleStore provides a set of built-in functions designed to work together for time series analysis:
|
Function |
Type |
Purpose |
|---|---|---|
|
Aggregate |
Returns the value associated with the earliest timestamp in a group | |
|
Aggregate |
Returns the value associated with the latest timestamp in a group | |
|
Scalar |
Groups timestamps into fixed-width intervals |
In addition to time series functions, you can use SERIES TIMESTAMP clause, used in CREATE TABLE command to designate a table column as the default timestamp for time series functions.
The following example demonstrates the use of FIRST, LAST, TIME_, and SERIES TIMESTAMP in a single query that downsamples 5-minute candlestick bars from raw trade data:
CREATE TABLE tick (ts DATETIME(6) SERIES TIMESTAMP,symbol VARCHAR(5),price NUMERIC(18,4),volume INT,SORT KEY (ts),SHARD KEY (symbol));INSERT INTO tick VALUES('2024-01-15 09:30:15.000000', 'AAPL', 185.20, 1500),('2024-01-15 09:31:45.000000', 'AAPL', 185.45, 800),('2024-01-15 09:33:20.000000', 'AAPL', 185.10, 1200),('2024-01-15 09:35:50.000000', 'AAPL', 185.80, 2000),('2024-01-15 09:37:10.000000', 'AAPL', 186.00, 900),('2024-01-15 09:38:30.000000', 'AAPL', 185.90, 1100),('2024-01-15 09:40:05.000000', 'AAPL', 186.20, 1800),('2024-01-15 09:42:45.000000', 'AAPL', 186.50, 700),('2024-01-15 09:44:15.000000', 'AAPL', 186.10, 1300);SELECT TIME_BUCKET('5m') AS bucket,symbol,FIRST(price) AS open,LAST(price) AS close,MAX(price) AS high,MIN(price) AS low,SUM(volume) AS total_volumeFROM tickWHERE symbol = 'AAPL'AND ts >= '2024-01-15 09:30:00'AND ts < '2024-01-15 16:00:00'GROUP BY bucket, symbolORDER BY bucket;
+----------------------------+--------+----------+----------+----------+----------+--------------+
| bucket | symbol | open | close | high | low | total_volume |
+----------------------------+--------+----------+----------+----------+----------+--------------+
| 2024-01-15 09:30:00.000000 | AAPL | 185.2000 | 185.1000 | 185.4500 | 185.1000 | 3500 |
| 2024-01-15 09:35:00.000000 | AAPL | 185.8000 | 185.9000 | 186.0000 | 185.8000 | 4000 |
| 2024-01-15 09:40:00.000000 | AAPL | 186.2000 | 186.1000 | 186.5000 | 186.1000 | 3800 |
+----------------------------+--------+----------+----------+----------+----------+--------------+Querying Time Series Data
Continuing the wind turbine example from the previous section, assume the following data is inserted into the turbine_ table:
INSERT INTO turbine_reading VALUES(1, '2020-03-14 13:00:33', 10, 33, 100, 1000000, 90, 15),(1, '2020-03-14 13:00:34', 10, 33, 100, 1000000, 90, 15),(1, '2020-03-14 13:00:35', 11, 33, 105, 1050000, 91, 16),(1, '2020-03-14 13:00:36', 11, 33.1, 104, 1000000, 90, 16),(2, '2020-03-14 13:00:33', 18, 30, 170, 2000000, 0, 23),(2, '2020-03-14 13:00:34', 18, 30, 170, 2000000, 0, 23),(2, '2020-03-14 13:00:35', 18.5, 30, 176, 2050000, 0, 23.5),(2, '2020-03-14 13:00:36', 19, 30.1, 174, 2070000, 1, 23.6),(1, '2020-03-15 13:00:33', 11, 32, 99, 1010000, 45, 15.1),(1, '2020-03-15 13:00:34', 11, 32, 99, 1020000, 45, 15.2),(1, '2020-03-15 13:00:35', 12, 32.1, 101, 1030000, 45, 15.2),(1, '2020-03-15 13:00:36', 13, 32.15, 102, 1030000, 46, 15.2);
The following query demonstrates how to calculate the average of all time-series values in the table.
-- average RPM by turbineSELECT tid, AVG(rpm)FROM turbine_readingGROUP BY tid;
+-----+----------+
| tid | AVG(rpm) |
+-----+----------+
| 2 | 18.375 |
| 1 | 11.125 |
+-----+----------+Time Bucketing
Use the TIME_ function to group timestamps into fixed intervals and return the start time of each bucket.
The following example uses TIME_ to group time-series data into 5-day intervals and calculate the average value for each interval:
SELECT tid, TIME_BUCKET("5d", ts), AVG(output) FROM turbine_reading GROUP BY 1, 2 ORDER BY 1, 2;
+-----+----------------------------+-------------+
| tid | TIME_BUCKET("5d", ts) | AVG(output) |
+-----+----------------------------+-------------+
| 1 | 2020-03-13 00:00:00.000000 | 1017500 |
| 2 | 2020-03-13 00:00:00.000000 | 2030000 |
+-----+----------------------------+-------------+To bucket data by day, one approach is to cast a high-resolution DATETIME(6) value to the DATE data type.N seconds, convert the timestamp to a unix_, divide the result by N using the DIV operator, multiply the result by N, and then convert it back to a timestamp value.N and then multiplying by N removes the remainder and produces a value that is evenly divisible by N.
-- Find high, low, and average output for each turbine, bucketed by day,-- sorted by day.SELECT tid, ts :> date, MIN(output), MAX(output), AVG(output)FROM turbine_readingGROUP by 1, 2ORDER BY 1, 2;
+-----+------------+-------------+-------------+-------------+
| tid | ts :> date | MIN(output) | MAX(output) | AVG(output) |
+-----+------------+-------------+-------------+-------------+
| 1 | 2020-03-14 | 1000000 | 1050000 | 1012500 |
| 1 | 2020-03-15 | 1010000 | 1030000 | 1022500 |
| 2 | 2020-03-14 | 2000000 | 2070000 | 2030000 |
+-----+------------+-------------+-------------+-------------+-- Find high, low, and average output for each turbine,-- bucketed by three second intervals, sorted by interval start time.SELECT tid,from_unixtime(unix_timestamp(ts) DIV 3 * 3) as ts,MIN(output), MAX(output), AVG(output)FROM turbine_readingGROUP by 1, 2ORDER BY 1, 2;
+-----+---------------------+-------------+-------------+--------------------+
| tid | ts | MIN(output) | MAX(output) | AVG(output) |
+-----+---------------------+-------------+-------------+--------------------+
| 1 | 2020-03-14 13:00:33 | 1000000 | 1050000 | 1016666.6666666666 |
| 1 | 2020-03-14 13:00:36 | 1000000 | 1000000 | 1000000 |
| 1 | 2020-03-15 13:00:33 | 1010000 | 1030000 | 1020000 |
| 1 | 2020-03-15 13:00:36 | 1030000 | 1030000 | 1030000 |
| 2 | 2020-03-14 13:00:33 | 2000000 | 2050000 | 2016666.6666666667 |
| 2 | 2020-03-14 13:00:36 | 2070000 | 2070000 | 2070000 |
+-----+---------------------+-------------+-------------+--------------------+Gap Filling
When you aggregate time-series data into buckets, some intervals may not contain data points.
For example, if you bucket turbine readings into 1-second intervals over a 10-second window, but data exists only for seconds 33 through 36, the result contains 4 rows instead of 10.
Gap filling ensures that every expected time bucket appears in the result set.
-
A reference table of time buckets: A table that contains one row per time interval and covers the full range of interest without gaps.
-
A
LEFT JOIN: A left outer join between the reference table and the aggregated data that preserves all time buckets in the output. -
NVLfor default values: A function that replacesNULLvalues (where no data exists) with a default value, such as0.
Create a table that contains one row per time interval for the required time range.time_ as a reference table to avoid any need to shuffle or broadcast during the LEFT JOIN.
CREATE REFERENCE TABLE time_days (ts DATETIME(6) NOT NULL,PRIMARY KEY (ts));
Populate the table with rows that cover the desired time range.
INSERT INTO time_days (ts) VALUES
('2020-03-14 00:00:00'),
('2020-03-15 00:00:00'),
('2020-03-16 00:00:00');Use the turbine_ table defined in Storing Time Series Data without gap filling, a standard bucketed aggregation returns only intervals where data exists:
SELECTTIME_BUCKET('1d', ts) AS bucket,tid,SUM(output) AS total_output,AVG(output) AS avg_outputFROM turbine_readingWHERE ts >= '2020-03-14 00:00:00'AND ts < '2020-03-17 00:00:00'GROUP BY bucket, tidORDER BY tid, bucket;
+----------------------------+-----+--------------+------------+
| bucket | tid | total_output | avg_output |
+----------------------------+-----+--------------+------------+
| 2020-03-14 00:00:00.000000 | 1 | 4050000 | 1012500 |
| 2020-03-15 00:00:00.000000 | 1 | 4090000 | 1022500 |
| 2020-03-14 00:00:00.000000 | 2 | 8120000 | 2030000 |
+----------------------------+-----+--------------+------------+Note that turbine 2 is missing a row for 2020-03-15, and neither turbine has a row for 2020-03-16.
The following query fills the gaps using a LEFT JOIN to the reference table:
WITH filtered AS (SELECT *FROM turbine_readingWHERE ts >= '2020-03-14 00:00:00'AND ts < '2020-03-17 00:00:00'),turbines AS (SELECT DISTINCT tidFROM filtered),agg AS (SELECTTIME_BUCKET('1d', ts) AS bucket,tid,SUM(output) AS total_output,AVG(output) AS avg_outputFROM filteredGROUP BY 1, 2),ref AS (SELECTTIME_BUCKET('1d', td.ts) AS bucket,t.tidFROM time_days tdCROSS JOIN turbines tWHERE td.ts >= '2020-03-14 00:00:00'AND td.ts < '2020-03-17 00:00:00')SELECTref.bucket,ref.tid,NVL(agg.total_output, 0) AS total_output,NVL(agg.avg_output, 0) AS avg_outputFROM refLEFT JOIN aggON agg.bucket = ref.bucketAND agg.tid = ref.tidORDER BY ref.tid, ref.bucket;
+----------------------------+-----+--------------+------------+
| bucket | tid | total_output | avg_output |
+----------------------------+-----+--------------+------------+
| 2020-03-14 00:00:00.000000 | 1 | 4050000 | 1012500 |
| 2020-03-15 00:00:00.000000 | 1 | 4090000 | 1022500 |
| 2020-03-16 00:00:00.000000 | 1 | 0 | 0 |
| 2020-03-14 00:00:00.000000 | 2 | 8120000 | 2030000 |
| 2020-03-15 00:00:00.000000 | 2 | 0 | 0 |
| 2020-03-16 00:00:00.000000 | 2 | 0 | 0 |
+----------------------------+-----+--------------+------------+Every turbine now has a row for every day in the range.
This query works as follows:
-
The
refsubquery generates every combination of time bucket and turbine ID by cross-joining the reference table (bucketed usingTIME_) with the distinct turbine IDs in the data range.BUCKET This ensures a row exists for every turbine in every time bucket in the interval. -
The
aggsubquery computes the actual aggregates (SUM,AVG) grouped by bucket and turbine. -
The
LEFT JOINpreserves all rows from ref, including the intervals without readings.The NVLfunction replaces the resultingNULLvalues with0.
Gap filling is useful in the following scenarios:
-
When visualizing time series in charts where gaps would break line continuity or mislead viewers.
-
When feeding data to monitoring or alerting systems that expect values at regular intervals.
-
When performing calculations (such as moving averages or cumulative sums) that require a complete sequence of equally-spaced data points.
-
When exporting data to external systems that do not handle missing intervals gracefully.
The gap-filling approach uses NVL to replace missing intervals with a default value of 0.
Interpolation
A time series may contain gaps that you want to fill so that a data point exists at every interval for a chosen time granularity.
You can interpolate missing data points by using a stored procedure.
DROP TABLE IF EXISTS tick_interpolation;CREATE TABLE tick_interpolation(ts datetime(6), symbol varchar(5),price numeric(18,4));INSERT INTO tick_interpolation VALUES('2019-02-18 10:55:36.000000', 'ABC', 100.00),('2019-02-18 10:55:37.000000', 'ABC', 102.00),('2019-02-18 10:55:40.000000', 'ABC', 103.00),('2019-02-18 10:55:42.000000', 'ABC', 104.00);DELIMITER //CREATE OR REPLACE PROCEDURE driver() ASDECLAREq query(ts datetime(6), symbol varchar(5), price numeric(18,4));BEGINq = SELECT ts, symbol, price FROM tick_interpolation ORDER BY ts;ECHO SELECT 'Input time series' AS message;ECHO SELECT * FROM q ORDER BY ts;ECHO SELECT 'Interpolated time series' AS message;CALL interpolate_ts(q);END //DELIMITER ;DELIMITER //CREATE OR REPLACE PROCEDURE interpolate_ts(q query(ts datetime(6), symbol varchar(5), price numeric(18,4)))-- Important: q must produce sorted output by tsASDECLAREc array(record(ts datetime(6), symbol varchar(5), price numeric(18,4)));r record(ts datetime(6), symbol varchar(5), price numeric(18,4));r_next record(ts datetime(6), symbol varchar(5), price numeric(18,4));n int;i int;_ts datetime(6); _symbol varchar(5); _price numeric(18,4);time_diff int;delta numeric(18,4);BEGINDROP TABLE IF EXISTS tmp;CREATE TEMPORARY TABLE tmp LIKE tick_interpolation;c = collect(q);n = length(c);IF n < 2 THENECHO SELECT * FROM q ORDER BY ts;return;END IF;i = 0;r = c[i];r_next = c[i + 1];WHILE (i < n) LOOP-- IF at last row THEN output it and exitIF i = n - 1 THEN_ts = r.ts; _symbol = r.symbol; _price = r.price;INSERT INTO tmp VALUES(_ts, _symbol, _price);i += 1;CONTINUE;END IF;time_diff = unix_timestamp(r_next.ts) - unix_timestamp(r.ts);IF time_diff <= 0 THENRAISE user_exception("time series not sorted or has duplicate timestamps");END IF;-- output r_ts = r.ts; _symbol = r.symbol; _price = r.price;INSERT INTO tmp VALUES(_ts, _symbol, _price);IF time_diff = 1 THENr = r_next; -- advance to next rowELSIF time_diff > 1 THEN-- output time_diff-1 rows by extending current row and interpolating pricedelta = (r_next.price - r.price) / time_diff;FOR j in 1..time_diff-1 LOOP_ts += 1; _price += delta;INSERT INTO tmp VALUES(_ts, _symbol, _price);END LOOP;r = r_next; -- advance to next rowELSERAISE user_exception("time series not sorted");END IF;i += 1;IF i < n - 1 THEN r_next = c[i + 1]; END IF;END LOOP;ECHO SELECT * FROM tmp ORDER BY ts;DROP TABLE tmp;END //DELIMITER ;
The driver() procedure produces the following output:
CALL driver();
+-------------------+
| message |
+-------------------+
| Input time series |
+-------------------+
1 row in set (0.02 sec)
+----------------------------+--------+----------+
| ts | symbol | price |
+----------------------------+--------+----------+
| 2019-02-18 10:55:36.000000 | ABC | 100.0000 |
| 2019-02-18 10:55:37.000000 | ABC | 102.0000 |
| 2019-02-18 10:55:40.000000 | ABC | 103.0000 |
| 2019-02-18 10:55:42.000000 | ABC | 104.0000 |
+----------------------------+--------+----------+
4 rows in set (0.06 sec)
+--------------------------+
| message |
+--------------------------+
| Interpolated time series |
+--------------------------+
1 row in set (0.16 sec)
+----------------------------+--------+----------+
| ts | symbol | price |
+----------------------------+--------+----------+
| 2019-02-18 10:55:36.000000 | ABC | 100.0000 |
| 2019-02-18 10:55:37.000000 | ABC | 102.0000 |
| 2019-02-18 10:55:38.000000 | ABC | 102.3333 |
| 2019-02-18 10:55:39.000000 | ABC | 102.6666 |
| 2019-02-18 10:55:40.000000 | ABC | 103.0000 |
| 2019-02-18 10:55:41.000000 | ABC | 103.5000 |
| 2019-02-18 10:55:42.000000 | ABC | 104.0000 |
+----------------------------+--------+----------+
7 rows in set (0.16 sec)The output fills the gaps between 37 and 40 seconds and between 40 and 42 seconds with linearly interpolated values.
Smoothing
Smoothing reduces noise in volatile signals so trends are easier to see.AVG as a window function to smooth time-series data.
SELECT tid, ts, output, AVG(output) OVER wFROM turbine_readingWHERE DATE(ts) = '2020-03-14'WINDOW w as (PARTITION BY tid ORDER BY tsROWS BETWEEN 1 PRECEDING AND CURRENT ROW)ORDER BY 1, 2;
+-----+----------------------------+---------+--------------------+
| tid | ts | output | AVG(output) OVER w |
+-----+----------------------------+---------+--------------------+
| 1 | 2020-03-14 13:00:33.000000 | 1000000 | 1000000 |
| 1 | 2020-03-14 13:00:34.000000 | 1000000 | 1000000 |
| 1 | 2020-03-14 13:00:35.000000 | 1050000 | 1025000 |
| 1 | 2020-03-14 13:00:36.000000 | 1000000 | 1025000 |
| 2 | 2020-03-14 13:00:33.000000 | 2000000 | 2000000 |
| 2 | 2020-03-14 13:00:34.000000 | 2000000 | 2000000 |
| 2 | 2020-03-14 13:00:35.000000 | 2050000 | 2025000 |
| 2 | 2020-03-14 13:00:36.000000 | 2070000 | 2060000 |
+-----+----------------------------+---------+--------------------+Finding a Row Current AS OF a Point in Time
A common time-series operation is to retrieve the row that is current AS OF a specific point in time.ORDER BY and LIMIT to identify that row:
-- find turbine reading for tid 1 that is current-- AS OF 2020-03-14 13:00:35.5SELECT *FROM turbine_readingWHERE ts <= '2020-03-14 13:00:35.5'AND tid = 1ORDER BY ts DESCLIMIT 1;
+-----+----------------------------+------+-------------+-----------+---------+----------------+------------+
| tid | ts | rpm | temperature | vibration | output | wind_direction | wind_speed |
+-----+----------------------------+------+-------------+-----------+---------+----------------+------------+
| 1 | 2020-03-14 13:00:35.000000 | 11 | 33 | 105 | 1050000 | 91 | 16 |
+-----+----------------------------+------+-------------+-----------+---------+----------------+------------+You can use EXPLAIN to view the query plan for the preceding query.ts to seek directly to the target timestamp and scan matching rows in reverse order, returning the row current as of a specified timestamp in O(log n) time.
This optimization requires a rowstore table because columnstore tables do not support index-based top-1 seeks.AS OF lookups, consider using a rowstore table instead of the default columnstore table.
You can implement equivalent AS OF queries on columnstore tables by using MAX(ts) with a window function or a correlated subquery.
To find the current row for each turbine as of a specific point in time, use the stored procedure shown in the following example.
DELIMITER //CREATE OR REPLACE PROCEDURE get_turbine_readings_as_of(_ts datetime(6))ASDECLAREq_turbines QUERY(tid int) = SELECT tid FROM turbine;a ARRAY(RECORD(tid int));_tid int;BEGINDROP TABLE IF EXISTS r;CREATE TEMPORARY TABLE r LIKE turbine_reading;a = COLLECT(q_turbines);FOR x IN a LOOP_tid = x.tid;INSERT INTO rSELECT *FROM turbine_reading tWHERE t.tid = _tidAND ts <= _tsORDER BY ts DESCLIMIT 1;END LOOP;ECHO SELECT * FROM r ORDER BY tid;DROP TABLE r;END //DELIMITER ;CALL get_turbine_readings_as_of('2020-03-14 13:00:35.5');
+-----+----------------------------+------+-------------+-----------+---------+----------------+------------+
| tid | ts | rpm | temperature | vibration | output | wind_direction | wind_speed |
+-----+----------------------------+------+-------------+-----------+---------+----------------+------------+
| 1 | 2020-03-14 13:00:35.000000 | 11 | 33 | 105 | 1050000 | 91 | 16 |
| 2 | 2020-03-14 13:00:35.000000 | 18.5 | 30 | 176 | 2050000 | 0 | 23.5 |
+-----+----------------------------+------+-------------+-----------+---------+----------------+------------+Managing the Life Cycle of Time Series Data
Manage the lifecycle of time-series data by moving older data from a rowstore table to a columnstore table when the data grows beyond available memory.DELETE statement.DELETE of large sets of rows is extremely fast in SingleStore, unlike legacy B-tree based database systems.
Related Topics
-
Training: Time Series Data Capture and Analysis
-
SingleStore blog on time series: It’s About Time: Getting More from Your Time-Series Data With SingleStoreDB Self-Managed 7.
0.
Last modified: