This demo will show how you can use an Entra service principal to read and write to an Azure SQL DB or Azure Postgres database from a Databricks notebook.
First, navigate to your tenant's Entra Id directory and add a new app registration. This will create an Entra Id service principal that you can use to authenticate to Azure resources:
Next, create a secret for the service principal.
Note: You'll need Tenant ID, service principal name, client ID and secret for the steps below.
The secret value will appear on the screen. Do not leave the page before recording it (it will never be displayed again). You may want to keep the browser tab open and complete the remaining steps in a second tab for convenience.
Create an Azure Key Vault in the same region
Assign yourself the Key Vault Secrets Officer role and the Azure Databricks application the Key Vault Secrets User role. (Databricks will access the key vault referenced in your secrets scope using the Databricks application's own service principal, which is unique to your tenant. You might expect a Unity Catalog-enabled workspace to use the workspace's managed identity to connect to the key vault, but unfortunately that's not the case.)
Add three secrets to the key vault: one for your tenant id, another for the service principal client Id (not the secret id) and a third for the secret value. You will need these to authenticate as the service principal in your Databricks notebook. Refer back to the secret value in the other browser tab as needed.
Many of the new Databricks features like serverless, data lineage and managed identity support require Unity Catalog. Unity Catalog is normally enabled now by default when you create a new workspace and gives you the most authentication options when connecting to Azure resources like databases. The steps described in this demo should work for both Unity Catalog-enabled workspaces and workspaces using the legacy hive metastore.
Next, we need to create a secrets scope for our Databricks workspace. Open your workspace and add #secrets/createScope
after the databricks instance url in your browser. It's case sensitive so be sure to use a capital S in 'createScope':
A page to configure a new secrets scope should appear. Give your secrets scope a name and paste the key vault uri (DNS Name) and resource id into the respective fields below. You can find these on the properties blade of your key vault.
If you just want to read from a database with a service principal, you can use the serverless cluster. Just add the azure-identity library to the environment configuration.
If you also want to write to the database, however, you'll need to use a provisioned cluster.
After the cluster has been created, click on the cluster name in the list and then switch to the libraries tab to install azure-identity with PyPi
Create an Azure SQL database with Entra Id authentication. You can use the Adventure Works LT sample to pre-populate it with data if you like. For the purposes of this demo, we will only be working with the information_schema.
Then connect to the database using the Entra Id admin user and create a user for the service principal. You can use the Query editor in the portal or SQL Server Management Studio.
CREATE USER [sql-dbx-read-write-test-sp] FROM EXTERNAL PROVIDER;
Add the dbmanagedidentity user to database roles db_datareader, db_datawriter and db_ddladmin. This will allow the managed identity to read and write data to existing tables. Using overwrite mode will automatically drop and recreate a table before writing to it.
ALTER ROLE db_datareader
ADD MEMBER [sql-dbx-read-write-test-sp];
ALTER ROLE db_datawriter
ADD MEMBER [sql-dbx-read-write-test-sp];
ALTER ROLE db_ddladmin
ADD MEMBER [sql-dbx-read-write-test-sp];
Paste the code below into the first cell of the notebook. Using the dbutils library, retrieve the tenant id, client id and service principal secret from the key vault referenced in the secrets scope and get an Entra token.
from azure.identity import ClientSecretCredential
tenant_id = dbutils.secrets.get(scope = "default", key = "tenant-id")
client_id = dbutils.secrets.get(scope = "default", key = "sql-dbx-read-write-test-sp-client-id")
client_secret = dbutils.secrets.get(scope = "default", key = "sql-dbx-read-write-test-sp-secret")
credential = ClientSecretCredential(tenant_id, client_id, client_secret)
token = credential.get_token("https://database.windows.net/.default").token
In the next cell, create a jdbc connection to the database and query a list of tables from the information schema.
jdbc_url = "jdbc:sqlserver://sql-xxxx.database.windows.net:1433;database=testdb"
connection_properties = {
"accessToken": token,
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
# Read from a table
df = spark.read.jdbc(url=jdbc_url, table="INFORMATION_SCHEMA.TABLES", properties=connection_properties)
df.show()
The output should look something like this
If you're using a provisioned cluster, you can run the code below to write to the database.
df.write.jdbc(
url=jdbc_url,
table="dbo.Test",
mode="overwrite",
properties=connection_properties
)
Running the code above on a serverless cluster will result in an error.
If you ask the Databricks AI assistant to Diagnose error, it will tell you that writing with jdbc isn't supported with serverless and you'll need to switch to a provisioned cluster instead.
To verify that the data was written as expected, you can read the table data into a dataframe
df_check = spark.read.jdbc(
url=jdbc_url,
table="dbo.Test",
properties=connection_properties
)
df_check.show()
You can find the complete notebook here
First create an Azure Database for PostgreSQL flexible server with Entra Id authentication. Then connect to the postgres database with the Entra ID administrator user.
In the example below, we connect using bash commands in Azure Cloud Shell in the portal. (You can use the PostgreSQL add-in in VS Code instead if you prefer.)
Set the environment variables in the Cloud Shell window.
export PGHOST=psql-xxxxx.postgres.database.azure.com
export PGUSER=user@domain.com
export PGPORT=5432
export PGDATABASE=postgres
export PGPASSWORD="$(az account get-access-token --resource https://ossrdbms-aad.database.windows.net --query accessToken --output tsv)"
Now simply type psql. The environment variables set above will be used automatically to connect.
psql
Once connected, run the statement below to create a user for the service principal
SELECT * FROM pgaadauth_create_principal('sql-dbx-read-write-test-sp',false, false);
Quit your connection to the postgres database and then open a new connection to your application database (testdb
in our example). Grant the service principal user read and write privileges on all tables in the public schema
ALTER DEFAULT PRIVILEGES IN SCHEMA public
GRANT SELECT, INSERT, UPDATE, DELETE ON TABLES TO "sql-dbx-read-write-test-sp";
GRANT ALL PRIVILEGES ON SCHEMA public TO "sql-dbx-read-write-test-sp";
Paste the code below into the first cell of the notebook. Using the dbutils library, retrieve the tenant id, client id and service principal secret from the key vault referenced in the secrets scope and get an Entra token. The url used to get the token is different from the one we used above to authenticate to Azure SQL DB.
from azure.identity import ClientSecretCredential
sp_name = "sql-dbx-read-write-test-sp"
tenant_id = dbutils.secrets.get(scope = "default", key = "tenant-id")
client_id = dbutils.secrets.get(scope = "default", key = "sql-dbx-read-write-test-sp-client-id")
client_secret = dbutils.secrets.get(scope = "default", key = "sql-dbx-read-write-test-sp-secret")
credential = ClientSecretCredential(tenant_id, client_id, client_secret)
token = credential.get_token("https://ossrdbms-aad.database.windows.net/.default").token
In the next cell, add the following code to connect to and read data from the database into a dataframe. You need to specify the service principal name as the user and pass the token for the password. Once the connection is made, however, we read and write to the database the same way we did for SQL.
jdbc_url = "jdbc:postgresql://psql-xxxxxxx.postgres.database.azure.com:5432/testdb"
connection_properties = {
"user": sp_name,
"password": token,
"driver": "org.postgresql.Driver",
"ssl": "true",
"sslfactory": "org.postgresql.ssl.NonValidatingFactory"
}
# Read from a table
df = spark.read.jdbc(url=jdbc_url, table="information_schema.tables", properties=connection_properties)
display(df)
Add a new cell and paste in the code below to write the contents of the dataframe to a new table in the database. The only difference to the SQL code above is that we are using the public schema in place of dbo.
df.write.jdbc(
url=jdbc_url,
table="public.Test",
mode="overwrite",
properties=connection_properties
)
Running the code above on a serverless cluster will result in an error.
If you ask the Databricks AI assistant to Diagnose error, it will tell you that writing with jdbc isn't supported with serverless and you'll need to switch to a provisioned cluster instead.
Finally, let's add a cell to read the table contents into a dataframe to confirm that the data was written.
df_check = spark.read.jdbc(
url=jdbc_url,
table="public.Test",
properties=connection_properties
)
df_check.show()
You can download the complete notebook here