SingleStore DB

Working with Geospatial Features

SingleStore has added support for geospatial queries. The term geospatial covers a very broad range of features. There is a formal standardization for some of those features, known as SQL-MM or OpenGIS. The features in SingleStore are not a complete implementation of that standard, but follow the same structure.

At a basic level, geospatial support requires three things: datatypes, topological functions, and measurement functions.

Geospatial Types
Figure 9. image
image

The three main geospatial object types are points, paths, and polygons. We support a subset of the standard Well-Known Text (WKT) syntax to describe them. For example, a POINT is a simple longitude / latitude pair. Note the lack of a comma between the numbers.

"POINT(-74.044514 40.689244)"

A linestring or path object is an ordered list of points. The points (vertices) in a linestring (or polygon) object are connected by edges.

"LINESTRING(0 0, 0 1, 1 1)"

A polygon is similar to a linestring, except that it is closed. That is, it must have at least three unique points, and the first and last point-pairs must be equal:

"POLYGON((0 0, 0 1, 1 1, 0 0))"

You can also describe multiple shapes (or rings) inside a polygon object, i.e., a polygon object can also be represented as a set of rings. For example, a set of islands, or a lake inside a country.

Each ring is represented as a set of points. The first ring (external ring) in the set represents the shape of the polygon. The subsequent rings in the set (internal rings) represent the holes in the shape. The rings within the boundary of a polygon (internal rings) can never cross. They may however intersect at a point, but only as a tangent. Each respective shape embedded within a polygon object is represented using the following notation.

"POLYGON(Ring 0, Ring 1, ... Ring N)"

Note the parenthesis around pairs of numbers, which are separated by commas.

"POLYGON((0 0, 0 10, 10 10, 10 0, 0 0), (5 5, 6 5, 5 6, 5 5))"

In the above example, rings (0 0, 0 10, 10 10, 10 0, 0 0) and (5 5, 6 5, 5 6, 5 5) represent Ring 0 and Ring 1, respectively.

All geospatial functions can accept either a geospatial object or the equivalent WKT string as arguments. SingleStore does not support the MULTIPOLYGON, GEOMETRY, CURVE, or other keywords.



Creating a Geospatial Table

Geospatial objects are just another data type in SingleStore and can be used right alongside numbers, strings, and JSON. To create a column that holds point data, use the GeographyPoint type. For path and polygon data, use the Geography type.

Important

It is possible to store POINTs in the Geography type, but is not recommended. The GeographyPoint type was designed specifically for point data, and is highly optimized for memory efficiency and speed.

In this example, we’ll create a table of neighborhoods. The centroid field is the point that is roughly in the middle of the borders of the neighborhood. The shape field is the polygon describing those borders. All geospatial data can be indexed for very high query performance.

CREATE TABLE neighborhoods (
  id INT UNSIGNED NOT NULL PRIMARY KEY,
  name VARCHAR(64) NOT NULL,
  population INT UNSIGNED NOT NULL,
  shape GEOGRAPHY NOT NULL,
  centroid GEOGRAPHYPOINT NOT NULL,

  index (shape) WITH (resolution = 8),
  index (centroid)
  );

  INSERT INTO neighborhoods VALUES
  (1, "Hamilton", 12346, "POLYGON((1 1,2 1,2 2, 1 2, 1 1))","POINT(1.5 1.5)"),
  (2, "Probability Bay", 263951, "POLYGON((5 1,6 1,6 2,5 2,5 1))", "POINT(5.5 1.5)"),
  (3, "Decentralized Park", 29265, "POLYGON((5 5,6 5,6 6,5 6,5 5))", "POINT(5.5 5.5)"),
  (4, "Axiom Township", 845696, "POLYGON((1 5,2 5,2 6,1 6,1 5))", "POINT(1.5 5.5)"),
  (5, "Elegant Island ", 987654, "POLYGON((3 3,4 3,4 4,3 4,3 3))", "POINT(3.5 3.5)");
Querying Geospatial Data
Figure 10. image
image


SingleStore provides a SQL function for each of these types of queries:

You can run queries over the neighborhood data to see how they relate to each other spatially. For example, you can join the table to itself and measure distances between neighborhoods:

SELECT b.name AS town,
  ROUND(GEOGRAPHY_DISTANCE(a.centroid, b.centroid), 0) AS distance_from_center,
  ROUND(GEOGRAPHY_DISTANCE(a.shape, b.shape), 0) AS distance_from_border
  FROM neighborhoods a, neighborhoods b
  WHERE a.id = 2
  ORDER BY 2;

  +--------------------+----------------------+----------------------+
  | town               | distance_from_center | distance_from_border |
  +--------------------+----------------------+----------------------+
  | Probability Bay    |                    0 |                    0 |
  | Elegant Island     |               314174 |               157090 |
  | Hamilton           |               444379 |               333195 |
  | Decentralized Park |               444531 |               333399 |
  | Axiom Township     |               628012 |               471030 |
  +--------------------+----------------------+----------------------+
  5 rows in set (0.00 sec)

Important

Distance is calculated using the standard metric for distance on a sphere.

You can also find out where you are with GEOGRAPHY_INTERSECTS:

SELECT id, name FROM neighborhoods WHERE
  GEOGRAPHY_INTERSECTS("POINT(2 1)", shape);

  +----+----------+
  | id | name     |
  +----+----------+
  |  1 | Hamilton |
  +----+----------+
  1 row in set (0.00 sec)
Spatial Joins

Geospatial objects are first-class datatypes in SingleStore, which means that you can use spatial relationships to join tables. For example, to find all of the businesses in your neighborhood, you join the businesses table with the neighborhoods table on whether the business location (a point) is contained by the neighborhood’s shape (a polygon).

Create the table:

CREATE TABLE businesses (
    id INT UNSIGNED NOT NULL PRIMARY KEY,
    name VARCHAR(64) NOT NULL,
    revenue INT UNSIGNED NOT NULL,
    location GEOGRAPHYPOINT NOT NULL,
    INDEX (location)
);

INSERT INTO businesses VALUES
   (1,"Touring Machines Bike Repair", 10000, "POINT(1.25 1.25)"),
   (2,"Manhattan Boat Club", 20000, "POINT(5.75 5.25)"),
   (3,"The Philospher's Diner", 12345, "POINT(1.25 5.25)"),
   (4,"Push & Pop Cafe", 23456, "POINT(3.25 3.75)"),
   (5,"Lil Mac's Subs", 76543, "POINT(5.75 5.75)"),
   (6,"Touring Machines Bike Repair", 21012, "POINT(5.25 1.75)"),
   (7,"Random Slice Pizza", 11111, "POINT(1.75 1.75)")
;

Find businesses contained within a specific neighborhood:

SELECT b.id, b.name FROM businesses b, neighborhoods n
  WHERE n.name="Elegant Island" AND
  GEOGRAPHY_CONTAINS(n.shape, b.location);

  +----+-----------------+
  | id | name            |
  +----+-----------------+
  |  4 | Push & Pop Cafe |
  +----+-----------------+
  1 row in set (0.00 sec)
SELECT b.id, b.name FROM businesses b, neighborhoods n
  WHERE n.name="Hamilton" AND
  GEOGRAPHY_CONTAINS(n.shape, b.location)
  ORDER BY 1;

  +----+------------------------------+
  | id | name                         |
  +----+------------------------------+
  |  1 | Touring Machines Bike Repair |
  |  7 | Random Slice Pizza           |
  +----+------------------------------+
  2 rows in set (0.03 sec)

These functions and many more are documented in Geospatial Functions.

Geospatial Model

SingleStore uses a spherical model similar to that used in Google Earth. It assumes a perfectly spherical Earth with a radius of 6,367,444.66 meters. The real Earth is not a sphere. It resembles an ellipsoid slightly wider than it is tall, with many local variations. In a spherical model, measurements of distance and area may vary from reality. For example, a 10,000 meter line drawn vertically over San Francisco would measure about 10,020 meters in this model. Topological functions, which describe the relationships between shapes and points, less affected by this difference.

Columnstore Geospatial

In SingleStore, geospatial data and functions are supported in columnstore (i.e. on-disk) tables. Columnstore tables do not support indexes on geo data. The native GeographyPoint type is supported but the Geography type is not. However, you can store path and polygon data in a plain text field, and run geospatial functions over them just as you do with rowstore (in-memory) tables.

CREATE TABLE neighborhoods_colstore(
  id INT UNSIGNED NOT NULL,
  name VARCHAR(64) NOT NULL,
  population INT UNSIGNED NOT NULL,
  shape TEXT NOT NULL,
  centroid GEOGRAPHYPOINT NOT NULL,
  key (name) USING clustered columnstore,shard key (id));

INSERT INTO neighborhoods_colstore VALUES
  (1, "Hamilton", 12346, "POLYGON((1 1,2 1,2 2, 1 2, 1 1))","POINT(1.5 1.5)"),
  (2, "Probability Bay", 263951, "POLYGON((5 1,6 1,6 2,5 2,5 1))", "POINT(5.5 1.5)"),
  (3, "Decentralized Park", 29265, "POLYGON((5 5,6 5,6 6,5 6,5 5))", "POINT(5.5 5.5)"),
  (4, "Axiom Township", 845696, "POLYGON((1 5,2 5,2 6,1 6,1 5))", "POINT(1.5 5.5)"),
  (5, "Elegant Island ", 987654, "POLYGON((3 3,4 3,4 4,3 4,3 3))", "POINT(3.5 3.5)");

SELECT id, name FROM neighborhoods_colstore WHERE
  GEOGRAPHY_INTERSECTS("POINT(2 1 )", shape);
Special Spatial Tuning Parameters

Spatial indices on Geography columns may be given an optional RESOLUTION parameter. You can choose a resolution between 6 and 32. This controls how many pieces the polygons and linestrings are broken up into for indexing. As a general rule, a lower number means less memory consumption, faster inserts and updates, but slower query times. A higher number may increase query performance at a cost of memory and insert performance.

Important

These tuning parameters should only be adjusted if you have a measurable need for more performance and have worked through the tradeoffs between ingest, computational load, and memory consumption.

For example, it may be useful to compare the percentage difference in the number of rows returned by GEOGRAPHY_INTERSECTS and APPROX_GEOGRAPHY_INTERSECTS.This difference is thefalse positive" rate of the index for a given resolution. For example, at resolution 16, suppose that 10% more records are returned by the approximate function. At resolution 8, the rate only grows to 11%. In that case you should be able to save memory and increase ingest throughput by using the lower number, without significantly harming query performance. Again, this kind of tuning is highly dependent on your particular dataset and workload. The parameters obtained for one application may not apply to another. Your mileage may vary.

You can also change the resolution of polygons passed into geospatial functions at query time. In this example, the resolution of the shape column is unchanged, but the POLYGON in the second argument is broken into 16 parts. Each part is queried separately. A lower number will result in fewer index lookup operations, but may return more false positive candidates from the index which are then filtered out of the result set in the second phase of query execution.

SELECT *
  FROM neighborhoods WITH (index = shape, resolution = 16)
  WHERE geography_intersects(shape,"POLYGON(...)");
Using GeoJSON Data

SingleStore does not have native GeoJSON support. However, we have generic JSON support as well as computed columns. Combining these features, you can import a subset of GeoJSON data using SingleStore’s built-in JSON type and GEOGRAPHYPOINT type.

For example, given a set of GeoJSON points like this:

{
    "type": "Feature",
    "geometry": {
      "type": "Point",
      "coordinates": [123.456, 78.901]
    },
    "properties": {
      "id": 123456,
      "name": "Probability Bay Naval Observatory"
    }
  }

You can construct a table like this:

CREATE TABLE geostuff (
  id AS dataz::properties::%id persisted BIGINT unsigned,
  name AS dataz::properties::$name persisted VARCHAR(128),
  location AS geography_point(
    dataz::geometry::coordinates::%`0`,
    dataz::geometry::coordinates::%`1`
  ) persisted geographypoint,
  dataz JSON NOT NULL,
  index (location)
  );

And then insert the JSON string:

INSERT INTO geostuff (dataz) VALUES ('{"type": "Feature", "geometry": {"type": "Point","coordinates": [123.456, 78.901]},"properties": {"id": 123456, "name": "Probability Bay Naval Observatory" }}');

SELECT id, name, location FROM geostuff;

  +--------+-----------------------------------+---------------------------------+
  | id     | name                              | location                        |
  +--------+-----------------------------------+---------------------------------+
  | 123456 | Probability Bay Naval Observatory | POINT(123.45600012 78.90100000) |
  +--------+-----------------------------------+---------------------------------+
Working with Window Functions

Window functions perform ranking and aggregate calculations across sets of rows that bear some relation to the current row. Unlike aggregates, which also act on multiple rows, window functions return a result for every row, instead of aggregating values from all rows into a single value.

Syntax

Window functions in SingleStore DB use the following syntax:

function ([expr]) OVER {window_name | (window_definition)}

window_definition:
    [partition_by_clause] [order_by_clause] [frame_clause]

partition_by_clause:
    PARTITION BY {col | expr}

order_by_clause:
    ORDER BY {col | expr}

frame_clause:
    {ROWS | RANGE}
        {UNBOUNDED PRECEDING | int PRECEDING | CURRENT ROW} |
        {BETWEEN
            {UNBOUNDED PRECEDING | int {PRECEDING | FOLLOWING} | CURRENT ROW}
         AND
            {UNBOUNDED FOLLOWING | int {PRECEDING | FOLLOWING} | CURRENT ROW}
        }
Syntax Remarks
window_definition

Window definition clauses are similar to standard GROUP BY and ORDER BY clauses with the following exception: you cannot refer to the projections list using aliases or position, because literal integers are treated as integers instead of positional references.

frame_clause

The frame_clause is used with an aggregate window function, and it specifies a subset of rows in a partition to evaluate. Ranking, value, and percentile window functions are not affected by the frame_clause because they are intended to operate on the entire set of rows in each partition.

In the frame_clause, int PRECEDING and int FOLLOWING are referred to as frame offsets. RANGE cannot be used with frame offsets. Additionally, frame offsets must be non-negative integers. Any non-integer is automatically rounded.

If BETWEEN is not specified, the default upper bound is CURRENT ROW. For example, ROWS UNBOUNDED PRECEDING is equivalent to ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

If no frame is specified, the default is RANGE UNBOUNDED PRECEDING, i.e. RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

RANGE includes all rows whose elements in the order_by_clause are equal. If the order_by_clause is empty, RANGE will include all rows in the partition. Consider the following example:

CREATE TABLE t (a INT);

INSERT INTO t VALUES (1),(2),(2),(3),(3),(3),(4);

SELECT COUNT(1) OVER (ORDER BY a ROWS UNBOUNDED PRECEDING) as 'col1',
          COUNT(1) OVER (ORDER BY a RANGE UNBOUNDED PRECEDING) as 'col2'  FROM t;
+------+------+
| col1 | col2 |
+------+------+
|    1 |    1 |
|    2 |    3 |
|    3 |    3 |
|    4 |    6 |
|    5 |    6 |
|    6 |    6 |
|    7 |    7 |
+------+------+
Supported Functions

The following built-in functions are supported in a window function:

Ranking functions:

Value functions:

Aggregate functions:

Percentile functions:

Named Window Definition

window_name is a reference to a named window definition given in a WINDOW clause, such as:

SELECT RANK() OVER window_name FROM table WINDOW window_name AS (window_definition)

If using multiple window functions on the same window frame, you may see improved performance by setting and using a window_name. Using the same window_name across multiple functions allows SingleStore DB code generation to make execution more efficient.

Consider the following examples. The statement below does not use a window_name for two identical window definitions.

SELECT website, timestamp,
AVG(clicks) OVER (PARTITION BY website
                  ORDER BY timestamp
                  ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING),
SUM(visits) OVER (PARTITION BY website
                  ORDER BY timestamp
                  ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING)
FROM t;

The statement above can be improved by using window_name syntax, which will improve performance:

SELECT website, timestamp,
AVG(clicks) OVER w,
SUM(visits) OVER w
FROM t
WINDOW w AS (PARTITION BY website
              ORDER BY timestamp
              ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING);

The second version of the query has fewer parameters, so the generated code can be smaller and more efficient.