-
Notifications
You must be signed in to change notification settings - Fork 7
Description
This is a duplicate of databricks/databricks-sql-python#300 . I am reposting it in this repo, as it still occurs with current versions of databricks-sqlalchemy~=1.0
(1.0.5) and databricks-sql-connector
(4.0.2) together with the DBR 15.4 LTS ML versions of pandas (1.5.3) and sqlalchemy (1.4.39).
Problem
When using the multi-row insert mode of sqlalchemy, currently each value gets its own named parameter in the generated query:
INSERT INTO default.test_table (numeric_col, string_col)
VALUES
(:numeric_col_m0, :string_col_m0),
(:numeric_col_m1, :string_col_m1),
(:numeric_col_m2, :string_col_m2),
(:numeric_col_m3, :string_col_m3),
[...]
(:numeric_col_m998, :string_col_m998),
(:numeric_col_m999, :string_col_m999)
with parameters:
{
'numeric_col_m0': 0,
'string_col_m0': 'AAA',
'numeric_col_m1': 1,
'string_col_m1': 'AAA',
'numeric_col_m2',
[...],
'numeric_col_m999': 999,
'string_col_m999': 'AAA'
}
Consequently, even very small tables exceed the limit of 256 named parameters:
sqlalchemy.exc.OperationalError: (databricks.sql.exc.RequestError) Error during request to server: BAD_REQUEST: Parameterized query has too many parameters: 2000 parameters were given but the limit is 256.. BAD_REQUEST: Parameterized query has too many parameters: 2000 parameters were given but the limit is 256.
Example
import pandas as pd
from sqlalchemy import create_engine
sqlalchemy_connection_string = f"databricks://token:{token}@{host}?http_path={http_path}?catalog={catalog}"
engine = create_engine(sqlalchemy_connection_string)
test_data = pd.DataFrame({"numeric_col": range(1_000), "string_col": ["AAA"] * 1_000})
test_data.to_sql("test_table", engine, if_exists="replace", index=False, method="multi")
Workarounds
Inline Parameters
I tried to avoid this issue by using legacy inline parameters. However, I get a different error then:
sqlalchemy_connection_string = f"databricks://token:{token}@{host}?http_path={http_path}?catalog={catalog}"
engine = create_engine(sqlalchemy_connection_string, connect_args={"use_inline_params": "silent"})
test_data = pd.DataFrame({"numeric_col": range(1_000), "string_col": ["AAA"] * 1_000})
test_data.to_sql("test_table", engine, if_exists="replace", index=False, method="multi")
DatabaseError: (databricks.sql.exc.ServerOperationError) [UNBOUND_SQL_PARAMETER] Found the unbound parameter: numeric_col_m0. Please, fix `args` and provide a mapping of the parameter to either a SQL literal or collection constructor functions such as `map()`, `array()`, `struct()`. SQLSTATE: 42P02; line 1 pos 57
Writing row-by-row
The only workaround at the moment seems to be to insert the data row-by-row by not setting method
in the to_sql()
method.
However, this is prohibitively slow even for medium-sized data frames: