Load Dataframes From Notebooks

The simplest and fastest way to load data is using a pandas dataframe. However, the schema inference is not always optimal. Define the table and schema in advance to avoid having to make future schema alterations.

Create a Dataset with Geospatial and Embeddings

Create a dataset with complex data types, such as, integer, date time, geospatial, and vectors.

Run the following command to install Shapely:

!pip3 install shapely

Create the dataset:

from shapely import wkt
import pandas as pd
import numpy as np
data = {
'Integer': [1, 2, 3, 4],
'Datetime': [pd.Timestamp(x) for x in ['2023-06-01', '2023-06-02', '2023-06-03', '2023-06-04']],
'Geospatial Data': [wkt.loads(x) for x in ['POLYGON((1 1,2 1,2 2, 1 2, 1 1))', 'POLYGON((5 1,6 1,6 2,5 2,5 1))', 'POLYGON((5 5,6 5,6 6,5 6,5 5))', 'POLYGON((1 5,2 5,2 6,1 6,1 5))']],
'Vectors': [np.array(x) for x in [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]]
}
df = pd.DataFrame(data)
display(df)

Connecting to the Database

Select the Connection and the Database to use from the dropdown boxes at the top of the notebook.

from sqlalchemy import *
db_connection = create_engine(connection_url)

Check Table Schema

Check the schema for the table; it is not optimal since geospatial and embeddings data were interpreted as text by the engine.

%%sql
SELECT
TABLE_SCHEMA,
TABLE_NAME,
COLUMN_NAME,
COLUMN_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'table_schema_inference' AND TABLE_SCHEMA = 'MyDatabase';

COLUMN_NAME

COLUMN_TYPE

Integer

bigint(20)

Datetime

datetime

Geospatial Data

text

Vectors

text

Create a Table with the Correct Schema

Geospatial indexes are only supported with a rowstore table. Ensure the schema is correct and rowstore in the creation process:

%%sql
CREATE ROWSTORE TABLE `table_schema_defined` (
`Integer` bigint(20) DEFAULT NULL,
`Datetime` datetime DEFAULT NULL,
`Geospatial Data` GEOGRAPHY NOT NULL,
`Vectors` blob,
index (`Geospatial Data`)
)

With the changes made to the SingleStore Python client library, no extra steps are required to convert the data before inserting it into a database. All numpy arrays, shapely and pygeos geometry objects in a dataframe will be converted into the correct format automatically.

Load the Dataframe Into the Predefined Table Schema

This method is fast but will not provide the best schema inference.

df.to_sql('table_schema_defined', con = db_connection, if_exists =
'append', index = False, chunksize = 1000)

Verify the Data

Check that the geospatial and vector data have been correctly loaded.

%%sql
select * from table_schema_defined

Last modified: August 14, 2024

Was this article helpful?

Verification instructions

Note: You must install cosign to verify the authenticity of the SingleStore file.

Use the following steps to verify the authenticity of singlestoredb-server, singlestoredb-toolbox, singlestoredb-studio, and singlestore-client SingleStore files that have been downloaded.

You may perform the following steps on any computer that can run cosign, such as the main deployment host of the cluster.

  1. (Optional) Run the following command to view the associated signature files.

    curl undefined
  2. Download the signature file from the SingleStore release server.

    • Option 1: Click the Download Signature button next to the SingleStore file.

    • Option 2: Copy and paste the following URL into the address bar of your browser and save the signature file.

    • Option 3: Run the following command to download the signature file.

      curl -O undefined
  3. After the signature file has been downloaded, run the following command to verify the authenticity of the SingleStore file.

    echo -n undefined |
    cosign verify-blob --certificate-oidc-issuer https://oidc.eks.us-east-1.amazonaws.com/id/CCDCDBA1379A5596AB5B2E46DCA385BC \
    --certificate-identity https://kubernetes.io/namespaces/freya-production/serviceaccounts/job-worker \
    --bundle undefined \
    --new-bundle-format -
    Verified OK