Load Dataframes From Notebooks

The simplest and fastest way to load data is using a pandas dataframe. However, the schema inference is not always optimal. Define the table and schema in advance to avoid having to make future schema alterations.

Create a Dataset with Geospatial and Embeddings

Create a dataset with complex data types, such as, integer, date time, geospatial, and vectors.

Run the following command to install Shapely:

!pip3 install shapely

Create the dataset:

from shapely import wkt
import pandas as pd
import numpy as np
data = {
'Integer': [1, 2, 3, 4],
'Datetime': [pd.Timestamp(x) for x in ['2023-06-01', '2023-06-02', '2023-06-03', '2023-06-04']],
'Geospatial Data': [wkt.loads(x) for x in ['POLYGON((1 1,2 1,2 2, 1 2, 1 1))', 'POLYGON((5 1,6 1,6 2,5 2,5 1))', 'POLYGON((5 5,6 5,6 6,5 6,5 5))', 'POLYGON((1 5,2 5,2 6,1 6,1 5))']],
'Vectors': [np.array(x) for x in [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]]
}
df = pd.DataFrame(data)
display(df)

Connecting to the Database

Select the Connection and the Database to use from the dropdown boxes at the top of the notebook.

from sqlalchemy import *
db_connection = create_engine(connection_url)

Check Table Schema

Check the schema for the table; it is not optimal since geospatial and embeddings data were interpreted as text by the engine.

%%sql
SELECT
TABLE_SCHEMA,
TABLE_NAME,
COLUMN_NAME,
COLUMN_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'table_schema_inference' AND TABLE_SCHEMA = 'MyDatabase';

COLUMN_NAME

COLUMN_TYPE

Integer

bigint(20)

Datetime

datetime

Geospatial Data

text

Vectors

text

Create a Table with the Correct Schema

Geospatial indexes are only supported with a rowstore table. Ensure the schema is correct and rowstore in the creation process:

%%sql
CREATE ROWSTORE TABLE `table_schema_defined` (
`Integer` bigint(20) DEFAULT NULL,
`Datetime` datetime DEFAULT NULL,
`Geospatial Data` GEOGRAPHY NOT NULL,
`Vectors` blob,
index (`Geospatial Data`)
)

With the changes made to the SingleStore Python client library, no extra steps are required to convert the data before inserting it into a database. All numpy arrays, shapely and pygeos geometry objects in a dataframe will be converted into the correct format automatically.

Load the Dataframe Into the Predefined Table Schema

This method is fast but will not provide the best schema inference.

df.to_sql('table_schema_defined', con = db_connection, if_exists =
'append', index = False, chunksize = 1000)

Verify the Data

Check that the geospatial and vector data have been correctly loaded.

%%sql
select * from table_schema_defined

Last modified: August 14, 2024

Was this article helpful?