Skip to content

pandas.to_sql fails when writing more than 256 cells to a SQL Warehouse with method="multi" #24

@paulstaab

Description

@paulstaab

This is a duplicate of databricks/databricks-sql-python#300 . I am reposting it in this repo, as it still occurs with current versions of databricks-sqlalchemy~=1.0 (1.0.5) and databricks-sql-connector (4.0.2) together with the DBR 15.4 LTS ML versions of pandas (1.5.3) and sqlalchemy (1.4.39).

Problem

When using the multi-row insert mode of sqlalchemy, currently each value gets its own named parameter in the generated query:

INSERT INTO default.test_table (numeric_col, string_col) 
VALUES 
   (:numeric_col_m0, :string_col_m0), 
   (:numeric_col_m1, :string_col_m1), 
   (:numeric_col_m2, :string_col_m2), 
   (:numeric_col_m3, :string_col_m3),
   [...]
   (:numeric_col_m998, :string_col_m998), 
   (:numeric_col_m999, :string_col_m999)

with parameters:

{
'numeric_col_m0': 0, 
'string_col_m0': 'AAA', 
'numeric_col_m1': 1,
'string_col_m1': 'AAA', 
'numeric_col_m2',
[...], 
'numeric_col_m999': 999, 
'string_col_m999': 'AAA'
}

Consequently, even very small tables exceed the limit of 256 named parameters:

sqlalchemy.exc.OperationalError: (databricks.sql.exc.RequestError) Error during request to server: BAD_REQUEST: Parameterized query has too many parameters: 2000 parameters were given but the limit is 256.. BAD_REQUEST: Parameterized query has too many parameters: 2000 parameters were given but the limit is 256.

Example

import pandas as pd 
from sqlalchemy import create_engine

sqlalchemy_connection_string = f"databricks://token:{token}@{host}?http_path={http_path}?catalog={catalog}"
engine = create_engine(sqlalchemy_connection_string)
test_data = pd.DataFrame({"numeric_col": range(1_000), "string_col": ["AAA"] * 1_000})
test_data.to_sql("test_table", engine, if_exists="replace", index=False, method="multi")

Workarounds

Inline Parameters

I tried to avoid this issue by using legacy inline parameters. However, I get a different error then:

sqlalchemy_connection_string = f"databricks://token:{token}@{host}?http_path={http_path}?catalog={catalog}"
engine = create_engine(sqlalchemy_connection_string, connect_args={"use_inline_params": "silent"})
test_data = pd.DataFrame({"numeric_col": range(1_000), "string_col": ["AAA"] * 1_000})
test_data.to_sql("test_table", engine, if_exists="replace", index=False, method="multi")
DatabaseError: (databricks.sql.exc.ServerOperationError) [UNBOUND_SQL_PARAMETER] Found the unbound parameter: numeric_col_m0. Please, fix `args` and provide a mapping of the parameter to either a SQL literal or collection constructor functions such as `map()`, `array()`, `struct()`. SQLSTATE: 42P02; line 1 pos 57

Writing row-by-row

The only workaround at the moment seems to be to insert the data row-by-row by not setting method in the to_sql() method.
However, this is prohibitively slow even for medium-sized data frames:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions