Analyzing Time Series Data
On this page
Time series data describe sequences of events, with each event labeled with a timestamp.
Ingesting Time Series Data
For bulk loading of historical collections of time series data, use the LOAD DATA command.
Managing the Life Cycle of Time Series Data
You can manage the life cycle of time series data by first moving it from a row store table to a column store table as it ages if the data becomes larger than available memory, and then ultimately removing data that is no longer needed using the DELETE
statement.
Interpolation
You may have a time series with gaps that you wish to fill, so that there is a data point at every point in time using your chosen time granularity.
You can interpolate missing points using a stored procedure.
DROP TABLE IF EXISTS tick;CREATE TABLE tick(ts datetime(6), symbol varchar(5),price numeric(18,4));INSERT INTO tick VALUES('2019-02-18 10:55:36.000000', 'ABC', 100.00),('2019-02-18 10:55:37.000000', 'ABC', 102.00),('2019-02-18 10:55:40.000000', 'ABC', 103.00),('2019-02-18 10:55:42.000000', 'ABC', 104.00);DELIMITER //CREATE OR REPLACE PROCEDURE driver() ASDECLAREq query(ts datetime(6), symbol varchar(5), price numeric(18,4));BEGINq = SELECT ts, symbol, price FROM tick ORDER BY ts;ECHO SELECT 'Input time series' AS message;ECHO SELECT * FROM q ORDER BY ts;ECHO SELECT 'Interpolated time series' AS message;CALL interpolate_ts(q);END //DELIMITER ;DELIMITER //CREATE OR REPLACE PROCEDURE interpolate_ts(q query(ts datetime(6), symbol varchar(5), price numeric(18,4)))-- Important: q must produce sorted output by tsASDECLAREc array(record(ts datetime(6), symbol varchar(5), price numeric(18,4)));r record(ts datetime(6), symbol varchar(5), price numeric(18,4));r_next record(ts datetime(6), symbol varchar(5), price numeric(18,4));n int;i int;_ts datetime(6); _symbol varchar(5); _price numeric(18,4);time_diff int;delta numeric(18,4);BEGINDROP TABLE IF EXISTS tmp;CREATE TEMPORARY TABLE tmp LIKE tick;c = collect(q);n = length(c);IF n < 2 THENECHO SELECT * FROM q ORDER BY ts;return;END IF;i = 0;r = c[i];r_next = c[i + 1];WHILE (i < n) LOOP-- IF at last row THEN output it and exitIF i = n - 1 THEN_ts = r.ts; _symbol = r.symbol; _price = r.price;INSERT INTO tmp VALUES(_ts, _symbol, _price);i += 1;CONTINUE;END IF;time_diff = unix_timestamp(r_next.ts) - unix_timestamp(r.ts);IF time_diff <= 0 THENRAISE user_exception("time series not sorted or has duplicate timestamps");END IF;-- output r_ts = r.ts; _symbol = r.symbol; _price = r.price;INSERT INTO tmp VALUES(_ts, _symbol, _price);IF time_diff = 1 THENr = r_next; -- advance to next rowELSIF time_diff > 1 THEN-- output time_diff-1 rows by extending current row and interpolating pricedelta = (r_next.price - r.price) / time_diff;FOR j in 1..time_diff-1 LOOP_ts += 1; _price += delta;INSERT INTO tmp VALUES(_ts, _symbol, _price);END LOOP;r = r_next; -- advance to next rowELSERAISE user_exception("time series not sorted");END IF;i += 1;IF i < n - 1 THEN r_next = c[i + 1]; END IF;END LOOP;ECHO SELECT * FROM tmp ORDER BY ts;DROP TABLE tmp;END //DELIMITER ;
The output of the driver()
procedure is as follows:
CALL driver();
+-------------------+
| message |
+-------------------+
| Input time series |
+-------------------+
1 row in set (0.02 sec)
+----------------------------+--------+----------+
| ts | symbol | price |
+----------------------------+--------+----------+
| 2019-02-18 10:55:36.000000 | ABC | 100.0000 |
| 2019-02-18 10:55:37.000000 | ABC | 102.0000 |
| 2019-02-18 10:55:40.000000 | ABC | 103.0000 |
| 2019-02-18 10:55:42.000000 | ABC | 104.0000 |
+----------------------------+--------+----------+
4 rows in set (0.06 sec)
+--------------------------+
| message |
+--------------------------+
| Interpolated time series |
+--------------------------+
1 row in set (0.16 sec)
+----------------------------+--------+----------+
| ts | symbol | price |
+----------------------------+--------+----------+
| 2019-02-18 10:55:36.000000 | ABC | 100.0000 |
| 2019-02-18 10:55:37.000000 | ABC | 102.0000 |
| 2019-02-18 10:55:38.000000 | ABC | 102.3333 |
| 2019-02-18 10:55:39.000000 | ABC | 102.6666 |
| 2019-02-18 10:55:40.000000 | ABC | 103.0000 |
| 2019-02-18 10:55:41.000000 | ABC | 103.5000 |
| 2019-02-18 10:55:42.000000 | ABC | 104.0000 |
+----------------------------+--------+----------+
7 rows in set (0.16 sec)
The gaps between 37 and 40 seconds and 40 and 42 seconds have been filled in with data points that are linearly interpolated.
Storing Time Series Data
Time series data can be stored in SingleStore using rowstore or columnstore tables.datetime(6)
.datetime
attribute if resolution of fractional seconds for timestamps is not needed.timestamp
data types for time series information because they are automatically updated by the system, and typically you will want to have your application provide the timestamp directly.ts
for the time attribute.
Here’s an example of a table created to hold a time series for events coming from a wind turbine.
CREATE ROWSTORE TABLE turbine_reading(tid int NOT NULL, -- turbine IDts datetime(6) NOT NULL,rpm double,temperature double,vibration double,output double,wind_direction double,wind_speed double,SHARD(tid),KEY(ts));
This table is a rowstore.
SingleStore recommends to create a KEY on the timestamp column ts
since it is common to query time series data by filtering on ranges of values.ts
so that range filters on ts
can be processed efficiently.ts
for a columnstore table will cause the table to be kept in order by ts
so range filters on ts
can also be processed efficiently using segment elimination.
For time series data sets with large numbers of attributes, where it is very common to retrieve all attributes of a table in the application, and the data is not bigger than the available table RAM you can provide, use a rowstore table to store time series data.
The tables used to store time series events are very similar to fact tables used in data warehouses and data marts.
Descriptive Data
For descriptive property information about time series elements that is static from one element to the next, it’s recommended to normalize this information into another table.
CREATE REFERENCE TABLE turbine(tid int,name varchar(60),model varchar(60),max_output double,lattitude double,longitude double,PRIMARY KEY(tid));
For small collections of descriptive properties, use a reference table.
A descriptive data table like the one described above can be thought of as a dimension table that is linked to the fact table containing the time series events.
In examples below, we’ll use the following data in the table turbine
:
INSERT INTO turbine VALUES(1, 'Hood River A', 'Volkswind Mega 5', 5.0, 47.130, 113.187),(2, 'Hood River B', 'Volkswind Mega 5+', 5.3, 47.141, 113.199);
Querying Time Series Data
Continuing the wind turbine example from above, suppose the following data is added to the turbine_
table:
INSERT turbine_reading VALUES(1, '2020-03-14 13:00:33', 10, 33, 100, 1000000, 90, 15),(1, '2020-03-14 13:00:34', 10, 33, 100, 1000000, 90, 15),(1, '2020-03-14 13:00:35', 11, 33, 105, 1050000, 91, 16),(1, '2020-03-14 13:00:36', 11, 33.1, 104, 1000000, 90, 16),(2, '2020-03-14 13:00:33', 18, 30, 170, 2000000, 0, 23),(2, '2020-03-14 13:00:34', 18, 30, 170, 2000000, 0, 23),(2, '2020-03-14 13:00:35', 18.5, 30, 176, 2050000, 0, 23.5),(2, '2020-03-14 13:00:36', 19, 30.1, 174, 2070000, 1, 23.6),(1, '2020-03-15 13:00:33', 11, 32, 99, 1010000, 45, 15.1),(1, '2020-03-15 13:00:34', 11, 32, 99, 1020000, 45, 15.2),(1, '2020-03-15 13:00:35', 12, 32.1, 101, 1030000, 45, 15.2),(1, '2020-03-15 13:00:36', 13, 32.15, 102, 1030000, 46, 15.2);
The following query illustrates how to compute a simple average aggregate over all time series values in the table.
-- average RPM by turbineSELECT tid, AVG(rpm)FROM turbine_readingGROUP BY tid;
+-----+----------+
| tid | AVG(rpm) |
+-----+----------+
| 2 | 18.375 |
| 1 | 11.125 |
+-----+----------+
Time Bucketing
The following queries illustrate how to perform time bucketing
to aggregate and group data for different time series by a fixed time interval.datetime(6)
value to a date
type.epoch
), dividing the result by N with the integer division operator DIV
, then multiplying again by N, and converting back to a timestamp value.DIV
by N and then multiplying by N returns a number divisible by N; the remainder is eliminated.
-- Find high, low, and average output for each turbine, bucketed by day,-- sorted by day.SELECT tid, ts :> date, MIN(output), MAX(output), AVG(output)FROM turbine_readingGROUP by 1, 2ORDER BY 1, 2;
+-----+------------+-------------+-------------+-------------+
| tid | ts :> date | MIN(output) | MAX(output) | AVG(output) |
+-----+------------+-------------+-------------+-------------+
| 1 | 2020-03-14 | 1000000 | 1050000 | 1012500 |
| 1 | 2020-03-15 | 1010000 | 1030000 | 1022500 |
| 2 | 2020-03-14 | 2000000 | 2070000 | 2030000 |
+-----+------------+-------------+-------------+-------------+
-- Find high, low, and average output for each turbine,-- bucketed by three second intervals, sorted by interval start time.SELECT tid,from_unixtime(unix_timestamp(ts) DIV 3 * 3) as ts,MIN(output), MAX(output), AVG(output)FROM turbine_readingGROUP by 1, 2ORDER BY 1, 2;
+-----+---------------------+-------------+-------------+--------------------+
| tid | ts | MIN(output) | MAX(output) | AVG(output) |
+-----+---------------------+-------------+-------------+--------------------+
| 1 | 2020-03-14 13:00:33 | 1000000 | 1050000 | 1016666.6666666666 |
| 1 | 2020-03-14 13:00:36 | 1000000 | 1000000 | 1000000 |
| 1 | 2020-03-15 13:00:33 | 1010000 | 1030000 | 1020000 |
| 1 | 2020-03-15 13:00:36 | 1030000 | 1030000 | 1030000 |
| 2 | 2020-03-14 13:00:33 | 2000000 | 2050000 | 2016666.6666666667 |
| 2 | 2020-03-14 13:00:36 | 2070000 | 2070000 | 2070000 |
+-----+---------------------+-------------+-------------+--------------------+
You can use the TIME_
The following example uses TIME_
to find the average time series value grouped by 5 day intervals:
SELECT tid, TIME_BUCKET("5d", ts), AVG(output) FROM turbine_reading GROUP BY 1, 2 ORDER BY 1, 2;
+-----+----------------------------+-------------+
| tid | TIME_BUCKET("5d", ts) | AVG(output) |
+-----+----------------------------+-------------+
| 1 | 2020-03-13 00:00:00.000000 | 1017500 |
| 2 | 2020-03-13 00:00:00.000000 | 2030000 |
+-----+----------------------------+-------------+
Smoothing
Time series can be smoothed using AVG
as a windowed aggregate.output
and the moving average of output
over a two-element window, on a specified date.
SELECT tid, ts, output, AVG(output) OVER wFROM turbine_readingWHERE DATE(ts) = '2020-03-14'WINDOW w as (PARTITION BY tid ORDER BY tsROWS BETWEEN 1 PRECEDING AND CURRENT ROW)ORDER BY 1, 2;
+-----+----------------------------+---------+--------------------+
| tid | ts | output | AVG(output) OVER w |
+-----+----------------------------+---------+--------------------+
| 1 | 2020-03-14 13:00:33.000000 | 1000000 | 1000000 |
| 1 | 2020-03-14 13:00:34.000000 | 1000000 | 1000000 |
| 1 | 2020-03-14 13:00:35.000000 | 1050000 | 1025000 |
| 1 | 2020-03-14 13:00:36.000000 | 1000000 | 1025000 |
| 2 | 2020-03-14 13:00:33.000000 | 2000000 | 2000000 |
| 2 | 2020-03-14 13:00:34.000000 | 2000000 | 2000000 |
| 2 | 2020-03-14 13:00:35.000000 | 2050000 | 2025000 |
| 2 | 2020-03-14 13:00:36.000000 | 2070000 | 2060000 |
+-----+----------------------------+---------+--------------------+
Finding a Row Current AS OF a Point in Time
A common operation on time series data is to find the row that is current AS OF a point in time.ORDER BY
and LIMIT
as follows.
-- find turbine reading for tid 1 that is current-- AS OF 2020-03-14 13:00:35.5SELECT *FROM turbine_readingWHERE ts <= '2020-03-14 13:00:35.5'AND tid = 1ORDER BY ts DESCLIMIT 1;
+-----+----------------------------+------+-------------+-----------+---------+----------------+------------+
| tid | ts | rpm | temperature | vibration | output | wind_direction | wind_speed |
+-----+----------------------------+------+-------------+-----------+---------+----------------+------------+
| 1 | 2020-03-14 13:00:35.000000 | 11 | 33 | 105 | 1050000 | 91 | 16 |
+-----+----------------------------+------+-------------+-----------+---------+----------------+------------+
You can use EXPLAIN
to see the query plan for the query above.ts
and scans in reverse order.
To find the current row for each turbine as of a specific point in time, you can use a stored procedure, as shown below.
DELIMITER //CREATE OR REPLACE PROCEDURE get_turbine_readings_as_of(_ts datetime(6))ASDECLAREq_turbines QUERY(tid int) = SELECT tid FROM turbine;a ARRAY(RECORD(tid int));_tid int;BEGINDROP TABLE IF EXISTS r;CREATE TEMPORARY TABLE r LIKE turbine_reading;a = COLLECT(q_turbines);FOR x IN a LOOP_tid = x.tid;INSERT INTO rSELECT *FROM turbine_reading tWHERE t.tid = _tidAND ts <= _tsORDER BY ts DESCLIMIT 1;END LOOP;ECHO SELECT * FROM r ORDER BY tid;DROP TABLE r;END //DELIMITER ;CALL get_turbine_readings_as_of('2020-03-14 13:00:35.5');
+-----+----------------------------+------+-------------+-----------+---------+----------------+------------+
| tid | ts | rpm | temperature | vibration | output | wind_direction | wind_speed |
+-----+----------------------------+------+-------------+-----------+---------+----------------+------------+
| 1 | 2020-03-14 13:00:35.000000 | 11 | 33 | 105 | 1050000 | 91 | 16 |
| 2 | 2020-03-14 13:00:35.000000 | 18.5 | 30 | 176 | 2050000 | 0 | 23.5 |
+-----+----------------------------+------+-------------+-----------+---------+----------------+------------+
Supplemental Material
-
Training: Time Series Data Capture and Analysis
-
Additional time series examples are given in the following SingleStore blog on time series: What SingleStore Can Do For Time Series Applications.
These examples include a method for creating candlestick charts with window functions, a general function for convenient time bucketing, and FIRST and LAST
LAST user-defined aggregate functions that can be used as regular aggregates, not just window functions.
Last modified: March 8, 2024