Skip to main content

Step 3: Create data generator functionality

The following are several functions used to generate data. The utility function, marketcap_to_DECIMAL, is used by the seed_trades stored procedure to convert a string value for the market capitalization of a stock into a decimal value.

DELIMITER //
CREATE OR REPLACE FUNCTION marketcap_to_DECIMAL(s VARCHAR(15))
  RETURNS DECIMAL(18,2) AS
DECLARE
  m CHAR(1) = SUBSTR(s, LENGTH(s), 1); -- M or B
  raw_v DECIMAL(18,2) = SUBSTR(s, 2, LENGTH(s) - 1);
  v DECIMAL(18,2) = NULL;
BEGIN
  IF m = "B" THEN
    v = raw_v * 1000;
  ELSE
    v = raw_v;
  END IF;
  RETURN v;
END //
DELIMITER ;

The seed_trades stored procedure generates rows of seed trade data based on an integer input value. This provides a query to rank the companies by market capitalization limited to the top 200.

The stored procedure generates approximately the same number of trades for each company.

DELIMITER //
CREATE OR REPLACE PROCEDURE seed_trades(num_trades INT) RETURNS INT AS
DECLARE
  ranked_companies ARRAY(RECORD(symbol CHAR(5), _rank INT));
DECLARE
  q QUERY(symbol CHAR(5), _rank INT) =
    SELECT symbol, rank() OVER (ORDER BY marketcap_to_DECIMAL(market_cap)) AS _rank
    FROM company
    WHERE LENGTH(symbol) < 5
    ORDER BY _rank DESC LIMIT 200;
  i INT = 0;
  rank_num INT;
  next_id INT = 1;
  sym CHAR(5);
  price_base DECIMAL(18,4);
  current_prices ARRAY(INT);
  l ARRAY(RECORD(symbol CHAR(5), _rank INT));
BEGIN
  l = collect(q);
  FOR r IN l LOOP
    i += 1;
    rank_num = r._rank;
    sym = r.symbol;
    price_base = FLOOR(rand() * 50) + 50;
    FOR j IN 1..((rank_num / 10) + RAND() * 10) LOOP
      INSERT trade VALUES(
        next_id,
        sym,
        FLOOR(1 + RAND() * 10) * 100, -- shares
        price_base,      -- share_price
        DATE_ADD(NOW(), INTERVAL RAND() * 6 HOUR)); -- random time during trading day, roughly
      next_id += 1;
      IF next_id > num_trades THEN RETURN(next_id); END IF;
    END LOOP;
  END LOOP;
  RETURN(next_id);
END //
DELIMITER ;

This last stored procedure, iter_stocks, generates additional trade events using a random walk to determine the next price for each company. It will store the current price for each company, then add a random number between -1 and 1 to set the new price for each iteration.

Each iteration inserts new records equal to num_shares. When iter_stocks completes, the total number of records in the trade table will be equal to num_shares * iterations + num_shares.

DELIMITER //
CREATE OR REPLACE PROCEDURE iter_stocks(iterations INT) as
DECLARE
    tickers ARRAY(CHAR(5));
    prices ARRAY(DECIMAL(18,4));
    last_ids ARRAY(bigINT);
    counts ARRAY(INT);
    next_id bigINT = 1;
    ticker CHAR(5);
    price DECIMAL(18,4);
    c INT;
    rand DECIMAL(18,4);
    tickers_q QUERY(t CHAR(5), p DECIMAL(18,4), lid BIGINT, c INT) = SELECT stock_symbol, share_price, MIN(id), COUNT(*) FROM trade GROUP BY stock_symbol;
    q ARRAY(RECORD(t CHAR(5), p DECIMAL(18,4), lid bigINT, c INT));
    q_count QUERY(c INT) = SELECT COUNT(*) FROM trade;
    total_c INT;
BEGIN
    q = COLLECT(tickers_q);
    tickers = CREATE_ARRAY(LENGTH(q));
    prices = CREATE_ARRAY(LENGTH(q));
    last_ids = CREATE_ARRAY(LENGTH(q));
    counts = CREATE_ARRAY(LENGTH(q));
    total_c = SCALAR(q_count);
    FOR r IN 0..LENGTH(q)-1 LOOP
        tickers[r] = q[r].t;
        prices[r] = q[r].p;
        last_ids[r] = q[r].lid;
        counts[r] = q[r].c;
    END LOOP;
    FOR j IN 0..(iterations-1) LOOP
        FOR i IN 0..LENGTH(tickers)-1 LOOP
            ticker = tickers[i];
            price = prices[i];
            next_id = last_ids[i];
            c = counts[i];
            rand = POW(-1, FLOOR(RAND()*2)) * RAND();
            INSERT INTO trade
            SELECT id + total_c, stock_symbol, shares, share_price + rand, trade_time FROM trade WHERE stock_symbol = ticker AND id >= next_id;
            prices[i] = price + rand;
            last_ids[i] = next_id + total_c;
        END LOOP;
    END LOOP;
END //
DELIMITER ;