This repository contains scripts and workflows for automating data refresh processes in Databricks. Below are the instructions on how to navigate the Databricks menus and use the features effectively.
To get started with Databricks, you need to have an account and access to a Databricks workspace. If you don't have an account, please contact your administrator.
The Home menu is your personal space in Databricks. Here you can:
- Access your recent notebooks and jobs.
- View your personal files and folders.
- Manage your user settings.
The Workspace menu is where you can organize and access all your notebooks, libraries, and other files. It is structured in a hierarchical manner:
- Users: Contains personal folders for each user.
- Shared: Contains shared folders and notebooks accessible by multiple users.
- Repos: Contains version-controlled repositories linked to GitHub or other version control systems.
The Repos menu allows you to manage your version-controlled repositories. You can:
- Clone repositories from GitHub.
- Commit and push changes.
- Manage branches and pull requests.
The Data menu is where you can manage your data sources. You can:
- Browse and explore datasets.
- Create and manage tables.
- Import data from various sources like S3, Azure Blob Storage, and more.
The Clusters menu is where you can create and manage clusters. Clusters are the computational resources used to run your notebooks and jobs. You can:
- Create new clusters.
- Start, stop, and restart clusters.
- Configure cluster settings and permissions.
The Jobs menu allows you to create and manage scheduled jobs. Jobs can be used to automate the execution of notebooks and workflows. You can:
- Create new jobs.
- Schedule jobs to run at specific times.
- Monitor job runs and view logs.
The SQL menu is where you can create and manage SQL queries and dashboards. You can:
- Write and execute SQL queries.
- Create visualizations and dashboards.
- Manage SQL endpoints and permissions.
The Settings menu allows you to configure various aspects of your Databricks workspace. You can:
- Manage user and group permissions.
- Configure workspace settings.
- Access audit logs and usage reports.
To run the data refresh workflow, follow these steps:
- Open the Notebook: Navigate to the
Data_Refresh_Workflow
notebook in the Workspace menu. - Attach to a Cluster: Ensure the notebook is attached to a running cluster.
- Run All Cells: Click
Run All
to execute all cells in the notebook. - Schedule the Job: Go to the Jobs menu, create a new job, and schedule it to run the
Data_Refresh_Workflow
notebook at your desired frequency.
For more detailed instructions, refer to the individual sections above.
If you would like to contribute to this repository, please fork the repository and submit a pull request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.