[FEATURE REQUEST]: Spark connector for SQL Server (bulk insert through JDBC) #514
-
JDBC driver for SQL Server is very slow. I noticed that the JDBC driver uses sp_prepare followed by sp_execute for each inserted row, therefore the operation is not a bulk insert (low performance of the batch size of 2 000,000 rows and more). Similar case: https://techcommunity.microsoft.com/t5/datacat/turbo-boost-data-loads-from-spark-using-sql-spark-connector/ba-p/305523# There is Spark connector for Azure SQL Database and SQL Server available for Scala, but not for C# .net core. It seems similar solution should be provided for .net core to allow a bulk insert from Spark into SQL Server database. |
Beta Was this translation helpful? Give feedback.
Replies: 15 comments
-
Thanks for the suggestion. Unfortunately, I don't think we can support/maintain this in our repo (unless there is a huge demand for this). If you need to use this from you application, you can use https://github.com/aelij/IgnoresAccessChecksToGenerator to access internal classes of Spark .NET to call Scala functions for the connector. Please let me know if you need additional help. Once that is working, you can create a PR against https://github.com/Azure/azure-sqldb-spark to see if they are willing to take it. |
Beta Was this translation helpful? Give feedback.
-
@imback82 I'm trying to work on such a connector because the "regular" JDBC insert is a bottleneck for us. However, I got stuck and I'd need a hint. How would you use the Spark .NET internals to create an instance of their Config class? I tried running |
Beta Was this translation helpful? Give feedback.
-
@moredatapls Can you share the full log? There should be a message that precedes the one you shared. Btw, |
Beta Was this translation helpful? Give feedback.
-
Which static method would I call? I get the following errors: jvm.CallConstructor("com.microsoft.azure.sqldb.spark.config.Config", new Hashtable(jvm))
// System.Exception : JVM method execution failed: Constructor failed for class 'com.microsoft.azure.sqldb.spark.config.Config' when called with 1 arguments ([Index=1, Type=Hashtable, Value=Microsoft.Spark.Interop.Internal.Java.Util.Hashtable], )
// 20/06/02 22:42:17 ERROR DotnetBackendHandler: Failed to execute '<init>' on 'NullObject' with args=([Type=java.util.Hashtable, Value: {}])
jvm.CallStaticJavaMethod("com.microsoft.azure.sqldb.spark.config.Config", "apply", new Hashtable(jvm))
// System.Exception : JVM method execution failed: Static method 'apply' failed for class 'com.microsoft.azure.sqldb.spark.config.Config' when called with 1 arguments ([Index=1, Type=Hashtable, Value=Microsoft.Spark.Interop.Internal.Java.Util.Hashtable], )
// 20/06/02 22:29:49 WARN DotnetBackendHandler: cannot find matching method interface com.microsoft.azure.sqldb.spark.config.Config.apply. Candidates are:
// 20/06/02 22:29:49 WARN DotnetBackendHandler: apply(class java.lang.String,interface scala.reflect.ClassTag)
// 20/06/02 22:29:49 ERROR DotnetBackendHandler: Failed to execute 'apply' on 'NullObject' with args=([Type=java.util.Hashtable, Value: {}]) I also tried using the jvm.CallConstructor("com.microsoft.azure.sqldb.spark.connect.DataFrameWriterFunctions", dataFrameWriter).Invoke("sqlDB", null)
// [2020-06-02T20:50:18.2648630Z] [Error] [JvmBridge] JVM method execution failed: Nonstatic method 'sqlDB' failed for class '28' when called with 1 arguments ([Index=1, Type=null, Value=null], )
// [2020-06-02T20:50:18.2649700Z] [Error] [JvmBridge] java.lang.NullPointerException
// at com.microsoft.azure.sqldb.spark.connect.DataFrameWriterFunctions.sqlDB(DataFrameWriterFunctions.scala:44) |
Beta Was this translation helpful? Give feedback.
-
Can you use C# |
Beta Was this translation helpful? Give feedback.
-
Same result unfortunately: jvm.CallStaticJavaMethod("com.microsoft.azure.sqldb.spark.config.Config", "apply", new Dictionary<string, string>());
// System.Exception : JVM method execution failed: Static method 'apply' failed for class 'com.microsoft.azure.sqldb.spark.config.Config' when called with 1 arguments ([Index=1, Type=Dictionary`2, Value=System.Collections.Generic.Dictionary`2[System.String,System.String]], )
// 20/06/03 09:40:22 WARN DotnetBackendHandler: cannot find matching method interface com.microsoft.azure.sqldb.spark.config.Config.apply. Candidates are:
// 20/06/03 09:40:22 WARN DotnetBackendHandler: apply(class java.lang.String,interface scala.reflect.ClassTag)
// 20/06/03 09:40:22 ERROR DotnetBackendHandler: Failed to execute 'apply' on 'NullObject' with args=([Type=java.util.HashMap, Value: {}]) The "candidate" suggested by the DotnetBackendHandler matches the method signature of the |
Beta Was this translation helpful? Give feedback.
-
try to set some large number for "batchsize" option when you write to jdbc. As far as I know jdbc is using common batch api that is controlled by this parameter and should work good for mssql. Also, if you reading from jdbc, also set fetchsize (although there could be different parametes on connection string to control buffer). E.g. for oracle default fetch size is 10 or 100, so it's very slow to read. |
Beta Was this translation helpful? Give feedback.
-
If anybody stumbles across this thread and happens to use Databricks and Azure Synapse SQL: there's this amazing and incredibly fast connector to read and write data from and to the data warehouse: https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html Totally solved our performance issues and works with .NET for Spark as well. |
Beta Was this translation helpful? Give feedback.
-
@moredatapls Thanks for sharing! This is good to know. Are you still exploring calling |
Beta Was this translation helpful? Give feedback.
-
This is very useful feature if implemented! We are trying to implement a analytics engine with SQL Server as source as well as sink for the analysis results to which our front end systems connect to. I think this feature can enable various analytics use cases as most of .NET application are built on SQL Server. |
Beta Was this translation helpful? Give feedback.
-
Greatly appreciate if you can share any simple example showing how to read/write to SQL Server with .NET for Spark. |
Beta Was this translation helpful? Give feedback.
-
@imback82 yes, I'm still interested in getting this to work. Will try to upload a repo with some example code so that you can see what I tried so far. @rrekapalli it is actually very straightforward, just use the Spark for .NET equivalent of the Spark functions shown in the documentation. Remember though that this only works if you use Databricks and their Spark runtime and only if you use Azure, which is of course a very limiting factor. I don't think that their drivers are even open-source. EDIT: I haven't tried this with a "regular" SQL server, only with the Azure Synapse SQL pool, and I don't think it will work. session.Conf().Set($"fs.azure.account.key.{blobStorageAccountName}.blob.core.windows.net", blobStorageAccessKey);
dataFrame.Write()
.Format("com.databricks.spark.sqldw")
.Option("url", $"jdbc:sqlserver://{databaseUrl}:{databasePort};databaseName={databaseName};encrypt=true;"))
.Option("user", username)
.Option("password", password)
.Option("forwardSparkAzureStorageCredentials", "true")
.Option("dbTable", table)
.Option("tempDir", $"wasbs://{blobStorageContainerName}@{blobStorageAccountName}.blob.core.windows.net/{blobStorageDirectoryName}")
.Mode(SaveMode.Append) // or any other save mode you want, not sure which ones they support. Append works for sure
.Save(); |
Beta Was this translation helpful? Give feedback.
-
@moredatapls , thank you very much and appreciate for the details! We are evaluating .NET for Spark for one of our .NET based application that has a regular SQL Server (v.2012) back-end. So was interested to see if there's any such library/driver (jdbc/odbc) that can help us transfer data to/from SQL Server. |
Beta Was this translation helpful? Give feedback.
-
Hi @paul-pl , @moredatapls , Any luck with improving write performance with JDBC drivers (or any other solution)? |
Beta Was this translation helpful? Give feedback.
-
FYI, we have recently added documentation on how to connect to SQL server through .NET for Apache Spark, please feel free to refer to it here. |
Beta Was this translation helpful? Give feedback.
If anybody stumbles across this thread and happens to use Databricks and Azure Synapse SQL: there's this amazing and incredibly fast connector to read and write data from and to the data warehouse: https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html
Totally solved our performance issues and works with .NET for Spark as well.