[FEATURE REQUEST]: Spark connector for SQL Server (bulk insert through JDBC) #514

paul-pl · 2020-05-07T20:23:42Z

paul-pl
May 7, 2020

JDBC driver for SQL Server is very slow. I noticed that the JDBC driver uses sp_prepare followed by sp_execute for each inserted row, therefore the operation is not a bulk insert (low performance of the batch size of 2 000,000 rows and more).

Similar case: https://techcommunity.microsoft.com/t5/datacat/turbo-boost-data-loads-from-spark-using-sql-spark-connector/ba-p/305523#

There is Spark connector for Azure SQL Database and SQL Server available for Scala, but not for C# .net core.
https://docs.microsoft.com/en-us/azure/sql-database/sql-database-spark-connector

It seems similar solution should be provided for .net core to allow a bulk insert from Spark into SQL Server database.

Answered by moredatapls

Jun 27, 2020

If anybody stumbles across this thread and happens to use Databricks and Azure Synapse SQL: there's this amazing and incredibly fast connector to read and write data from and to the data warehouse: https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html

Totally solved our performance issues and works with .NET for Spark as well.

View full answer

imback82 · 2020-05-17T23:44:37Z

imback82
May 17, 2020

Thanks for the suggestion. Unfortunately, I don't think we can support/maintain this in our repo (unless there is a huge demand for this).

If you need to use this from you application, you can use https://github.com/aelij/IgnoresAccessChecksToGenerator to access internal classes of Spark .NET to call Scala functions for the connector. Please let me know if you need additional help. Once that is working, you can create a PR against https://github.com/Azure/azure-sqldb-spark to see if they are willing to take it.

0 replies

moredatapls · 2020-06-02T08:28:39Z

moredatapls
Jun 2, 2020

@imback82 I'm trying to work on such a connector because the "regular" JDBC insert is a bottleneck for us. However, I got stuck and I'd need a hint.

How would you use the Spark .NET internals to create an instance of their Config class? I tried running jvm.CallConstructor(...), but I keep getting [Exception] [JvmBridge] JVM method execution failed: Constructor failed for class... errors.

0 replies

imback82 · 2020-06-02T16:24:09Z

imback82
Jun 2, 2020

@moredatapls Can you share the full log? There should be a message that precedes the one you shared. Btw, Config class is a trait, so you may want to use object Config by using CallStaticJavaMethod.

0 replies

moredatapls · 2020-06-02T20:54:07Z

moredatapls
Jun 2, 2020

Which static method would I call? apply? It should probably be very obvious, but I'm not super familiar with Scala...

I get the following errors:

jvm.CallConstructor("com.microsoft.azure.sqldb.spark.config.Config", new Hashtable(jvm))
// System.Exception : JVM method execution failed: Constructor failed for class 'com.microsoft.azure.sqldb.spark.config.Config' when called with 1 arguments ([Index=1, Type=Hashtable, Value=Microsoft.Spark.Interop.Internal.Java.Util.Hashtable], )
// 20/06/02 22:42:17 ERROR DotnetBackendHandler: Failed to execute '<init>' on 'NullObject' with args=([Type=java.util.Hashtable, Value: {}])

jvm.CallStaticJavaMethod("com.microsoft.azure.sqldb.spark.config.Config", "apply", new Hashtable(jvm))
// System.Exception : JVM method execution failed: Static method 'apply' failed for class 'com.microsoft.azure.sqldb.spark.config.Config' when called with 1 arguments ([Index=1, Type=Hashtable, Value=Microsoft.Spark.Interop.Internal.Java.Util.Hashtable], )
// 20/06/02 22:29:49 WARN DotnetBackendHandler: cannot find matching method interface com.microsoft.azure.sqldb.spark.config.Config.apply. Candidates are:
// 20/06/02 22:29:49 WARN DotnetBackendHandler: apply(class java.lang.String,interface scala.reflect.ClassTag)
// 20/06/02 22:29:49 ERROR DotnetBackendHandler: Failed to execute 'apply' on 'NullObject' with args=([Type=java.util.Hashtable, Value: {}])

I also tried using the ConfigBuilder, but same thing. What works is e.g. this (the NullPointerException is because I don't pass a correct Options object, but at least the call itself works):

jvm.CallConstructor("com.microsoft.azure.sqldb.spark.connect.DataFrameWriterFunctions", dataFrameWriter).Invoke("sqlDB", null)
// [2020-06-02T20:50:18.2648630Z] [Error] [JvmBridge] JVM method execution failed: Nonstatic method 'sqlDB' failed for class '28' when called with 1 arguments ([Index=1, Type=null, Value=null], )
// [2020-06-02T20:50:18.2649700Z] [Error] [JvmBridge] java.lang.NullPointerException
//        at com.microsoft.azure.sqldb.spark.connect.DataFrameWriterFunctions.sqlDB(DataFrameWriterFunctions.scala:44)

0 replies

imback82 · 2020-06-02T21:06:13Z

imback82
Jun 2, 2020

jvm.CallStaticJavaMethod("com.microsoft.azure.sqldb.spark.config.Config", "apply", new Hashtable(jvm))

Can you use C# Dictionary instead? Our interop layer converts it to Scala Map.

0 replies

moredatapls · 2020-06-03T08:17:11Z

moredatapls
Jun 3, 2020

Same result unfortunately:

jvm.CallStaticJavaMethod("com.microsoft.azure.sqldb.spark.config.Config", "apply", new Dictionary<string, string>());
// System.Exception : JVM method execution failed: Static method 'apply' failed for class 'com.microsoft.azure.sqldb.spark.config.Config' when called with 1 arguments ([Index=1, Type=Dictionary`2, Value=System.Collections.Generic.Dictionary`2[System.String,System.String]], )
// 20/06/03 09:40:22 WARN DotnetBackendHandler: cannot find matching method interface com.microsoft.azure.sqldb.spark.config.Config.apply. Candidates are:
// 20/06/03 09:40:22 WARN DotnetBackendHandler: apply(class java.lang.String,interface scala.reflect.ClassTag)
// 20/06/03 09:40:22 ERROR DotnetBackendHandler: Failed to execute 'apply' on 'NullObject' with args=([Type=java.util.HashMap, Value: {}])

The "candidate" suggested by the DotnetBackendHandler matches the method signature of the apply method in trait Config.

0 replies

mikeTWC1984 · 2020-06-18T04:03:58Z

mikeTWC1984
Jun 18, 2020

try to set some large number for "batchsize" option when you write to jdbc. As far as I know jdbc is using common batch api that is controlled by this parameter and should work good for mssql. Also, if you reading from jdbc, also set fetchsize (although there could be different parametes on connection string to control buffer). E.g. for oracle default fetch size is 10 or 100, so it's very slow to read.

0 replies

moredatapls · 2020-06-27T21:02:10Z

moredatapls
Jun 27, 2020

If anybody stumbles across this thread and happens to use Databricks and Azure Synapse SQL: there's this amazing and incredibly fast connector to read and write data from and to the data warehouse: https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html

Totally solved our performance issues and works with .NET for Spark as well.

0 replies

imback82 · 2020-06-27T21:41:04Z

imback82
Jun 27, 2020

@moredatapls Thanks for sharing! This is good to know. Are you still exploring calling com.microsoft.azure.sqldb.spark.config.Config? If so I can follow up on this (sorry for responding late, got lost track of this).

0 replies

rrekapalli · 2020-06-29T14:08:28Z

rrekapalli
Jun 29, 2020

JDBC driver for SQL Server is very slow. I noticed that the JDBC driver uses sp_prepare followed by sp_execute for each inserted row, therefore the operation is not a bulk insert (low performance of the batch size of 2 000,000 rows and more).

Similar case: https://techcommunity.microsoft.com/t5/datacat/turbo-boost-data-loads-from-spark-using-sql-spark-connector/ba-p/305523#

There is Spark connector for Azure SQL Database and SQL Server available for Scala, but not for C# .net core.
https://docs.microsoft.com/en-us/azure/sql-database/sql-database-spark-connector

It seems similar solution should be provided for .net core to allow a bulk insert from Spark into SQL Server database.

This is very useful feature if implemented! We are trying to implement a analytics engine with SQL Server as source as well as sink for the analysis results to which our front end systems connect to. I think this feature can enable various analytics use cases as most of .NET application are built on SQL Server.

0 replies

rrekapalli · 2020-06-29T14:14:58Z

rrekapalli
Jun 29, 2020

If anybody stumbles across this thread and happens to use Databricks and Azure Synapse SQL: there's this amazing and incredibly fast connector to read and write data from and to the data warehouse: https://docs.databricks.com/data/data-sources/azure/synapse-analytics.html

Totally solved our performance issues and works with .NET for Spark as well.

Greatly appreciate if you can share any simple example showing how to read/write to SQL Server with .NET for Spark.

0 replies

moredatapls · 2020-06-29T17:18:12Z

moredatapls
Jun 29, 2020

@imback82 yes, I'm still interested in getting this to work. Will try to upload a repo with some example code so that you can see what I tried so far.

@rrekapalli it is actually very straightforward, just use the Spark for .NET equivalent of the Spark functions shown in the documentation. Remember though that this only works if you use Databricks and their Spark runtime and only if you use Azure, which is of course a very limiting factor. I don't think that their drivers are even open-source. EDIT: I haven't tried this with a "regular" SQL server, only with the Azure Synapse SQL pool, and I don't think it will work.

session.Conf().Set($"fs.azure.account.key.{blobStorageAccountName}.blob.core.windows.net", blobStorageAccessKey);

dataFrame.Write()
    .Format("com.databricks.spark.sqldw")
    .Option("url", $"jdbc:sqlserver://{databaseUrl}:{databasePort};databaseName={databaseName};encrypt=true;"))
    .Option("user", username)
    .Option("password", password)
    .Option("forwardSparkAzureStorageCredentials", "true")
    .Option("dbTable", table)
    .Option("tempDir", $"wasbs://{blobStorageContainerName}@{blobStorageAccountName}.blob.core.windows.net/{blobStorageDirectoryName}")
    .Mode(SaveMode.Append) // or any other save mode you want, not sure which ones they support. Append works for sure
    .Save();

0 replies

rrekapalli · 2020-06-30T12:38:57Z

rrekapalli
Jun 30, 2020

@moredatapls , thank you very much and appreciate for the details! We are evaluating .NET for Spark for one of our .NET based application that has a regular SQL Server (v.2012) back-end. So was interested to see if there's any such library/driver (jdbc/odbc) that can help us transfer data to/from SQL Server.

0 replies

rrekapalli · 2020-08-04T11:26:32Z

rrekapalli
Aug 4, 2020

Hi @paul-pl , @moredatapls , Any luck with improving write performance with JDBC drivers (or any other solution)?

0 replies

Niharikadutta · 2020-10-14T08:21:46Z

Niharikadutta
Oct 14, 2020
Collaborator

FYI, we have recently added documentation on how to connect to SQL server through .NET for Apache Spark, please feel free to refer to it here.

0 replies

[FEATURE REQUEST]: Spark connector for SQL Server (bulk insert through JDBC) #514

Uh oh!

Replies: 15 comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Niharikadutta Oct 14, 2020 Collaborator

Niharikadutta
Oct 14, 2020
Collaborator