Arquitectura Databricks

This repository contains Databricks notebooks for practicing data analysis and processing using both Python and Scala. The project provides both unsolved and solved versions of the notebooks, so you can practise without the solution and then check it if you are blocked.

For the final project, a Faker-based script generator is included to create synthetic data if needed.

Not all notebooks are provider in both languages, for instance the AI related are only in python

Repository Structure

/
├── Section_folder/
│   ├── python/
│   │   ├── notebook1.py            # Unsolved version
│   │   ├── notebook1_solved.py     # Solved version
│   ├── scala/
│   │   ├── notebook2.scala         # Unsolved version
│   │   ├── notebook2_solved.scala  # Solved version
├── datasets/
│   ├── dataset_name                # Dataset used in the notebooks
│   │   ├──dataset_file.csv         # File for the dataset (some dataset may have sevaral files)
│   │   ├──dataset_file.json        # File for the dataset (some dataset may have sevaral files)
├── README.md

Getting Started

Running the Notebooks

Open the notebooks folder and choose between Python (.py) or Scala (.scala).
Import notebook in Databricks UI by clicking on the Import button.
If you want to challenge yourself, start with the unsolved versions before checking the solutions.

Final Project

Dataset for the final project in under the dataset/project folder If you want to create aditional syntetic data you can modify as you please and script in 7_proyecto/dataset_generator

In order to do so

Steps to Run the Faker Generator:

Make sure you have python installed in your local environment
Navigate to the 7_proyecto/dataset_generator folder.
Create a virtual environment:
```
python -m venv venv
```
Activate the virtual environment:
- On Windows:
```
venv\Scripts\activate
```
- On macOS/Linux:
```
source venv/bin/activate
```
Install dependencies:
```
pip install -r requirements.txt
```
Run the script:
```
python generate_data.py
```
With the default configuration this will create a product_catalog.csv with 50 rows and 5 sales_data<INDEX>.csv files with 25000 rows each. You can modify it as you please

Notes

Ensure you have Python installed to run the Faker script.
Databricks users should upload the dataset to DBFS before running Scala notebooks.
The provided solutions should only be referenced after attempting the exercises.

Enjoy coding!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
1_Fundamentos_Recoleccion_Datos		1_Fundamentos_Recoleccion_Datos
2_Procesamiento_de_datos_con_Spark		2_Procesamiento_de_datos_con_Spark
3_Almacenamiento_de_datos		3_Almacenamiento_de_datos
4_Procesamiento_Analitico		4_Procesamiento_Analitico
5_Visualizacion		5_Visualizacion
7_Proyecto		7_Proyecto
datasets		datasets
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Arquitectura Databricks

Repository Structure

Getting Started

Running the Notebooks

Final Project

Steps to Run the Faker Generator:

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Languages

SCouto/ArquitecturaDatabricks

Folders and files

Latest commit

History

Repository files navigation

Arquitectura Databricks

Repository Structure

Getting Started

Running the Notebooks

Final Project

Steps to Run the Faker Generator:

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages