This repository contains Databricks notebooks for practicing data analysis and processing using both Python and Scala. The project provides both unsolved and solved versions of the notebooks, so you can practise without the solution and then check it if you are blocked.
For the final project, a Faker-based script generator is included to create synthetic data if needed.
Not all notebooks are provider in both languages, for instance the AI related are only in python
/
├── Section_folder/
│ ├── python/
│ │ ├── notebook1.py # Unsolved version
│ │ ├── notebook1_solved.py # Solved version
│ ├── scala/
│ │ ├── notebook2.scala # Unsolved version
│ │ ├── notebook2_solved.scala # Solved version
├── datasets/
│ ├── dataset_name # Dataset used in the notebooks
│ │ ├──dataset_file.csv # File for the dataset (some dataset may have sevaral files)
│ │ ├──dataset_file.json # File for the dataset (some dataset may have sevaral files)
├── README.md
- Open the notebooks folder and choose between Python (
.py
) or Scala (.scala
). - Import notebook in Databricks UI by clicking on the
Import
button. - If you want to challenge yourself, start with the unsolved versions before checking the solutions.
Dataset for the final project in under the dataset/project
folder
If you want to create aditional syntetic data you can modify as you please and script in 7_proyecto/dataset_generator
In order to do so
- Make sure you have python installed in your local environment
- Navigate to the
7_proyecto/dataset_generator
folder. - Create a virtual environment:
python -m venv venv
- Activate the virtual environment:
- On Windows:
venv\Scripts\activate
- On macOS/Linux:
source venv/bin/activate
- On Windows:
- Install dependencies:
pip install -r requirements.txt
- Run the script:
With the default configuration this will create a
python generate_data.py
product_catalog.csv
with 50 rows and 5sales_data<INDEX>.csv
files with 25000 rows each. You can modify it as you please
- Ensure you have Python installed to run the Faker script.
- Databricks users should upload the dataset to DBFS before running Scala notebooks.
- The provided solutions should only be referenced after attempting the exercises.
Enjoy coding!