Skip to content

Quantum-Software-Development/Integrated-Project-for-Business

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


🌟 Integrated Business Project – 3rd Semester at PUC-SP: Bachelor's in Humanistic AI & Data Science

Under the guidance of Professor Dr. Eric Bacconi, Coordinator of the Bachelor's Program in Humanistic AI & Data Science at PUC-SP.




$$\Huge {\textbf{\color{DodgerBlue} GOOD DECISIONS = GOOD RESULTS}}$$




Sponsor Quantum Software Development




Table of Contents

  1. Linear Regression and Data Scaling Analysis - Normalization - Scaling
  2. Practical Example for Calculating this Normalized Value in Python
  3. Pratical Example for Calculating this Normalized Value in Excel
  4. Range Standardization
  5. Pearson Correlation
  6. Linear Regression: Price Prediction Case Study
  7. Multiple Linear Regression Analysis Report
  8. Logistic Regression
  9. Discriminant Analysis
  10. Lasso, Ridge and Elastic Net Regression
  11. Brains Made of Code: Regression Training with Gradient Descent and Stochastic Optimization Algorithms



1. - Linear Regression and Data Scaling Analysis - Normalization - Scaling


This project demonstrates a complete machine learning workflow for price prediction using:

  • Stepwise Regression for feature selection
  • Advanced statistical analysis (ANOVA, R² metrics)
  • Full model diagnostics
  • Interactive visualization integration

➢➣ Get access to all the projects + dataset developed in this discipline.


It's describes the process of scaling or normalizing data within a specific range, typically to a standardized scale, for example, from 0 to 1. This is a common technique in data analysis and machine learning.



$$X_{normalized} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}$$


$$X_{\max} - X_{\min} = \text{Amplitude}$$


Is the amplitude, a way to represent the range of data values before normalization.


To calculate the standardization of the variables salario, n_filhos, and idade using both the Z-Score and Range methods, and to evaluate the mean, standard deviation, maximum, and minimum before and after standardization, we can follow these steps:

Compute the mean, standard deviation, maximum, and minimum for each of the variables (n_filhos, salario, idade).

We standardize the variables using the Z-Score method, which is computed as:

$Z$ = $\frac{X - \mu}{\sigma}$

Z = \frac{X - \mu}{\sigma}

Where:

  • $( \mu )$ is the mean,

  • $( \sigma )$ is the standard deviation.


We scale the data using the Min-Max method, which scales the values to a [0, 1] range using:

$X'$ = $\frac{X - \min(X)}{\max(X) - \min(X)}$

X' = \frac{X - \min(X)}{\max(X) - \min(X)}

Where:

  • X is the original value,
  • min(X) is the minimum value,
  • max(X) is the maximum value.

Compute the mean, standard deviation, maximum, and minimum of the standardized data for both Z-Score and Range methods.

The output will provide the descriptive statistics before and after each standardization method, allowing you to compare the effects of Z-Score and Range standardization on the dataset.


2. Practical Example for Calculating this Normalized Value in Python:

Use this dataset

The code demonstrates how to apply Z-Score and Range (Min-Max) standardization to the variables salario, n_filhos, and idade in a dataset. It also evaluates and compares the mean, standard deviation, minimum, and maximum values before and after the standardization methods are applied.


# Importing the necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Load the data from the Excel file
# df = pd.read_excel('use-your-own-dataset.xlsx') - optional
df = pd.read_excel('cadastro_funcionarios.xlsx')
df.head()  # Displaying the first few rows of the dataset to understand its structure

# Step 1: Evaluate the mean, std, max, and min before standardization
before_std_stats = {
    'mean_n_filhos': df['n_filhos'].mean(),
    'std_n_filhos': df['n_filhos'].std(),
    'min_n_filhos': df['n_filhos'].min(),
    'max_n_filhos': df['n_filhos'].max(),
    
    'mean_salario': df['salario'].mean(),
    'std_salario': df['salario'].std(),
    'min_salario': df['salario'].min(),
    'max_salario': df['salario'].max(),
    
    'mean_idade': df['idade'].mean(),
    'std_idade': df['idade'].std(),
    'min_idade': df['idade'].min(),
    'max_idade': df['idade'].max(),
}

# Display the statistics before standardization
before_std_stats

# Step 2: Z-Score Standardization
df_zscore = df[['n_filhos', 'salario', 'idade']].apply(lambda x: (x - x.mean()) / x.std())

# Display the standardized data
df_zscore.head()

# Step 3: Evaluate the mean, std, max, and min after Z-Score standardization
after_zscore_stats = {
    'mean_n_filhos_zscore': df_zscore['n_filhos'].mean(),
    'std_n_filhos_zscore': df_zscore['n_filhos'].std(),
    'min_n_filhos_zscore': df_zscore['n_filhos'].min(),
    'max_n_filhos_zscore': df_zscore['n_filhos'].max(),
    
    'mean_salario_zscore': df_zscore['salario'].mean(),
    'std_salario_zscore': df_zscore['salario'].std(),
    'min_salario_zscore': df_zscore['salario'].min(),
    'max_salario_zscore': df_zscore['salario'].max(),
    
    'mean_idade_zscore': df_zscore['idade'].mean(),
    'std_idade_zscore': df_zscore['idade'].std(),
    'min_idade_zscore': df_zscore['idade'].min(),
    'max_idade_zscore': df_zscore['idade'].max(),
}

# Display the statistics after Z-Score standardization
after_zscore_stats

Cell 6: Apply Range Standardization (Min-Max Scaling)

# Step 4: Range Standardization (Min-Max Scaling)
scaler = MinMaxScaler()
df_range = pd.DataFrame(scaler.fit_transform(df[['n_filhos', 'salario', 'idade']]), columns=['n_filhos', 'salario', 'idade'])

# Display the scaled data
df_range.head()

# Step 5: Evaluate the mean, std, max, and min after Range standardization
after_range_stats = {
    'mean_n_filhos_range': df_range['n_filhos'].mean(),
    'std_n_filhos_range': df_range['n_filhos'].std(),
    'min_n_filhos_range': df_range['n_filhos'].min(),
    'max_n_filhos_range': df_range['n_filhos'].max(),
    
    'mean_salario_range': df_range['salario'].mean(),
    'std_salario_range': df_range['salario'].std(),
    'min_salario_range': df_range['salario'].min(),
    'max_salario_range': df_range['salario'].max(),
    
    'mean_idade_range': df_range['idade'].mean(),
    'std_idade_range': df_range['idade'].std(),
    'min_idade_range': df_range['idade'].min(),
    'max_idade_range': df_range['idade'].max(),
}

# Display the statistics after Range standardization
after_range_stats

3. Pratical Example for Calculating this Normalized Value in [Excel

Use this dataset

To standardize the variables (salary, number of children, and age) in Excel using the Z-Score and Range methods, you can follow these steps:


Steps for Z-Score in Excel:

Use the AVERAGE function to calculate the mean of the column. For example, to find the mean of the salary (column E), use:

=AVERAGE(E2:E351)

Use the STDEV.P function to calculate the standard deviation of the column. For example, to find the standard deviation of the salary (column E), use:

=STDEV.P(E2:E351)

For each value in the column, apply the Z-Score formula. In the first row of the new column, use:

=(E2 - AVERAGE(E$2:E$351)) / STDEV.P(E$2:E$351)

Example for Salary:

In cell H2 (new column for standardized salary), write

=(E2 - AVERAGE(E$2:E$351)) / STDEV.P(E$2:E$351)

Then, drag it down to the rest of the rows.

Repeat the same steps for the variables n_filhos (column D) and idade (column F).


4. Range Standardization

Steps for Range Standardization in Excel:

Use the MIN and MAX functions to find the minimum and maximum values of the column. For example, to find the min and max of salary (column E), use:

=MIN(E2:E351)
=MAX(E2:E351)

For each value in the column, apply the range formula. In the first row of the new column, use:

=(E2 - MIN(E$2:E$351)) / (MAX(E$2:E$351) - MIN(E$2:E$351))

Example for Salary:

In cell I2 (new column for range standardized salary), write:

=(E2 - MIN(E$2:E$351)) / (MAX(E$2:E$351) - MIN(E$2:E$351))

Then, drag it down to the rest of the rows. Repeat the same steps for the variables n_filhos (column D) and idade (column F).


Z-Score Standardization centers the data around zero and scales it based on the standard deviation.

Range Standardization (Min-Max Scaling) rescals the data to a [0, 1] range.

Both techniques were applied (given dataset) to the columns n_filhos, salario, and idade, and the statistics (mean, std, min, max) were calculated before and after the standardization methods.


  • Correlation does not imply causation: Correlation between two variables does not necessarily mean that one causes the other. For example, there may be a correlation between the number of salespeople in a store and increased sales, but that does not imply that having more salespeople directly causes higher sales.

  • Using regressions we don’t need to worry about standardization: When using regressions, there is no need to worry about data standardization. Unlike other methods like k-NN or neural networks, where the scale of the data can impact performance, regression models can be applied directly without the need for scaling adjustments.


5. Pearson Correlation

Pearson Correlation is a statistical measure that describes the strength and direction of a linear relationship between two variables. The Pearson correlation value ranges from -1 to 1:

  • 1: Perfect positive correlation (both variables increase together).
  • -1: Perfect negative correlation (one variable increases while the other decreases).
  • 0: No linear correlation.

For example, if we're analyzing the correlation between the area of a house and its price, a Pearson value close to 1 would indicate that as the area increases, the price also tends to increase.


6. Simple Linear Regression

Simple Linear Regression is a statistical model that describes the relationship between a dependent variable (response) and an independent variable (predictor). The model is represented by the formula:

$$ y = \beta_0 + \beta_1 \cdot x $$

Where:

  • (y) is the dependent variable (the one we want to predict),
  • (x) is the independent variable (the one used to make predictions),
  • (\beta_0) is the intercept (the value of (y) when (x = 0)),
  • (\beta_1) is the coefficient (representing the change in (y) when (x) increases by one unit).

Simple linear regression is widely used for predicting a value based on a linear relationship between variables.

Steps to Perform Linear Regression:

  1. Data Collection: Gather the data related to the dependent and independent variables.
  2. Exploratory Data Analysis (EDA): Explore the data to identify trends, patterns, and check correlation.
  3. Model Fitting: Fit the linear regression model to the data using a method like Ordinary Least Squares (OLS).
  4. Model Evaluation: Evaluate the model performance using metrics like Mean Squared Error (MSE) and the Coefficient of Determination ((R^2)).
  5. Prediction: Use the fitted model to make predictions with new data.

Simple linear regression is a great starting point for predictive problems where a linear relationship between variables is expected.

1- Example Code - Correlation Vendas Gjornal

Use This Dataset - BD Gerais.xlsx

If you don't have the required libraries installed, you can install them with pip:


pip install pandas numpy matplotlib scikit-learn openpyxl


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset from the Excel file
file_path = 'BD Gerais.xlsx'
df = pd.read_excel(file_path)

# Display the first few rows of the dataset
print(df.head())

# Let's assume the columns are: 'Vendas', 'Gjornal', 'GTV', 'Gmdireta'
# Compute the correlation matrix
correlation_matrix = df.corr()
print("\nCorrelation Matrix:")
print(correlation_matrix)

# Perform linear regression: Let's use 'Vendas' as the target and 'Gjornal', 'GTV', 'Gmdireta' as features
X = df[['Gjornal', 'GTV', 'Gmdireta']]  # Features
y = df['Vendas']  # Target variable

# Create and train the model
model = LinearRegression()
model.fit(X, y)

# Print the regression coefficients
print("\nRegression Coefficients:")
print(f"Intercept: {model.intercept_}")
print(f"Coefficients: {model.coef_}")

# Make predictions
y_pred = model.predict(X)

# Calculate Mean Squared Error and R-squared
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

print(f"\nMean Squared Error: {mse}")
print(f"R-squared: {r2}")

# Plot the actual vs predicted values
plt.scatter(y, y_pred)
plt.plot([min(y), max(y)], [min(y), max(y)], color='red', linestyle='--')
plt.title('Actual vs Predicted Vendas')
plt.xlabel('Actual Vendas')
plt.ylabel('Predicted Vendas')
plt.show()

# Plot the regression line for each feature vs 'Vendas'
fig, axs = plt.subplots(1, 3, figsize=(15, 5))

# Plot for 'Gjornal'
axs[0].scatter(df['Gjornal'], y, color='blue')
axs[0].plot(df['Gjornal'], model.intercept_ + model.coef_[0] * df['Gjornal'], color='red')
axs[0].set_title('Gjornal vs Vendas')
axs[0].set_xlabel('Gjornal')
axs[0].set_ylabel('Vendas')

# Plot for 'GTV'
axs[1].scatter(df['GTV'], y, color='blue')
axs[1].plot(df['GTV'], model.intercept_ + model.coef_[1] * df['GTV'], color='red')
axs[1].set_title('GTV vs Vendas')
axs[1].set_xlabel('GTV')
axs[1].set_ylabel('Vendas')

# Plot for 'Gmdireta'
axs[2].scatter(df['Gmdireta'], y, color='blue')
axs[2].plot(df['Gmdireta'], model.intercept_ + model.coef_[2] * df['Gmdireta'], color='red')
axs[2].set_title('Gmdireta vs Vendas')
axs[2].set_xlabel('Gmdireta')
axs[2].set_ylabel('Vendas')

plt.tight_layout()
plt.show()

2- Example Code - Correlation Vendas - GTV

Use This Dataset - BD Gerais.xlsx

To compute the correlation between the Vendas and GTV columns in your dataset using Python, you can follow this code. This will calculate the correlation coefficient and visualize the relationship between these two variables using a scatter plot.

import pandas as pd
import matplotlib.pyplot as plt

7. - Multiple Linear Regression with 4 variable

  • Vendas as the dependent variable (Y)

  • Jornal, GTV, and Gmdireta as independent variables (X)

This code will also calculate the correlation matrix, fit the multiple linear regression model, and display the regression results.

Python Code for Multiple Linear Regression and Correlation


1- Install Required Libraries (if you don't have them yet)

pip install pandas numpy matplotlib statsmodels scikit-learn

2- Python Code


import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset from the Excel file
file_path = 'BD Gerais.xlsx'  # Adjust the file path if needed
df = pd.read_excel(file_path)

# Display the first few rows of the dataset to verify the data
print(df.head())

# Calculate the correlation matrix for the variables
correlation_matrix = df[['Vendas', 'Gjornal', 'GTV', 'Gmdireta']].corr()
print("\nCorrelation Matrix:")
print(correlation_matrix)

# Define the independent variables (X) and the dependent variable (Y)
X = df[['Gjornal', 'GTV', 'Gmdireta']]  # Independent variables
y = df['Vendas']  # Dependent variable (Vendas)

# Add a constant (intercept) to the independent variables
X = sm.add_constant(X)

# Fit the multiple linear regression model
model = sm.OLS(y, X).fit()

# Display the regression results
print("\nRegression Results:")
print(model.summary())

# Alternatively, using sklearn's LinearRegression to calculate the coefficients and R-squared
model_sklearn = LinearRegression()
model_sklearn.fit(X[['Gjornal', 'GTV', 'Gmdireta']], y)

# Coefficients and intercept
print("\nLinear Regression Coefficients (sklearn):")
print("Intercept:", model_sklearn.intercept_)
print("Coefficients:", model_sklearn.coef_)

# Predicting with the model
y_pred = model_sklearn.predict(X[['Gjornal', 'GTV', 'Gmdireta']])

# Calculating R-squared and Mean Squared Error (MSE)
r2 = r2_score(y, y_pred)
mse = mean_squared_error(y, y_pred)

print(f"\nR-squared: {r2:.4f}")
print(f"Mean Squared Error: {mse:.4f}")

# Plotting the actual vs predicted Vendas
plt.scatter(y, y_pred)
plt.plot([y.min(), y.max()], [y.min(), y.max()], '--k', color='red')  # line of perfect prediction
plt.xlabel('Actual Vendas')
plt.ylabel('Predicted Vendas')
plt.title('Actual vs Predicted Vendas')
plt.show()

Loading Data:

The dataset is loaded from BD Gerais.xlsx using pandas.read_excel(). The file path is adjusted based on your actual file location. Correlation Matrix:

We calculate the correlation matrix for the four variables: Vendas, Gjornal, GTV, and Gmdireta. This gives us an overview of the relationships between the variables.

We define the independent variables (Gjornal, GTV, Gmdireta) as X and the dependent variable (Vendas) as y. We add a constant term (intercept) to X using sm.add_constant() for proper regression. We use the statsmodels.OLS method to fit the multiple linear regression model and print the regression summary, which includes coefficients, R-squared, p-values, and more. Alternative Model (sklearn):

We also use sklearn.linear_model.LinearRegression() for comparison, which calculates the coefficients and R-squared. We then use the trained model to predict the Vendas values and calculate Mean Squared Error (MSE) and R-squared. Plotting:

The actual values of Vendas are plotted against the predicted values from the regression model in a scatter plot. A red line of perfect prediction is also added (this line represents the ideal case where actual values equal predicted values).

Correlation Matrix:

Displays the correlation between Vendas, Gjornal, GTV, and Gmdireta. This helps you understand the relationships between these variables. Regression Results (from statsmodels):

The regression summary will include: Coefficients: The relationship between each independent variable and the dependent variable (Vendas). R-squared: Measures how well the model fits the data. P-values: For testing the statistical significance of each coefficient.



  • The model's intercept and coefficients are printed for comparison.

  • R-squared and Mean Squared Error (MSE):

  • These two metrics evaluate the performance of the regression model.

  • R-squared tells you how well the model explains the variance in the dependent variable.

  • MSE gives an idea of the average squared difference between the predicted and actual values.

The plot shows how well the model's predicted Vendas values match the actual values.

Example Output (Model Summary from statsmodels):

                            OLS Regression Results
==============================================================================
Dep. Variable:                 Vendas   R-squared:                       0.982
Model:                            OLS   Adj. R-squared:                  0.980
Method:                 Least Squares   F-statistic:                     530.8
Date:                Thu, 10 Mar 2025   Prob (F-statistic):           2.31e-14
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         12.1532      3.001      4.055      0.001       5.892      18.414
Gjornal        2.4503      0.401      6.100      0.000       1.638       3.262
GTV            1.2087      0.244      4.948      0.000       0.734       1.683
Gmdireta       0.5003      0.348      1.437      0.168      -0.190       1.191
==============================================================================

R-squared of 0.982 indicates that the model explains 98.2% of the variance in Vendas.

The coefficients of the independent variables show how each variable affects Vendas.


If the p-value for Gmdireta is greater than 0.05, it means that Gmdireta is not statistically significant in explaining the variability in the dependent variable Vendas. In such a case, it's common practice to remove the variable from the model and perform the regression again with only the statistically significant variables.


P-value: The p-value is used to test the null hypothesis that the coefficient of the variable is equal to zero (i.e., the variable has no effect). If the p-value is greater than 0.05, it indicates that the variable is not statistically significant at the 5% level and doesn't provide much explanatory power in the model.

Adjusted R-squared: By removing variables that are not significant, you often improve the model's explanatory power (in some cases), as it helps reduce multicollinearity and overfitting.


Let’s update the code by removing Gmdireta from the regression model and re-running the analysis with just Gjornal and GTV as the independent variables.

import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset from the Excel file
file_path = 'BD Gerais.xlsx'  # Adjust the file path if needed
df = pd.read_excel(file_path)

# Display the first few rows of the dataset to verify the data
print(df.head())

# Calculate the correlation matrix for the variables
correlation_matrix = df[['Vendas', 'Gjornal', 'GTV']].corr()  # Excluding 'Gmdireta'
print("\nCorrelation Matrix (without Gmdireta):")
print(correlation_matrix)

# Define the independent variables (X) and the dependent variable (Y)
X = df[['Gjornal', 'GTV']]  # Independent variables (Gjornal and GTV)
y = df['Vendas']  # Dependent variable (Vendas)

# Add a constant (intercept) to the independent variables
X = sm.add_constant(X)

# Fit the multiple linear regression model
model = sm.OLS(y, X).fit()

# Display the regression results
print("\nRegression Results (without Gmdireta):")
print(model.summary())

# Alternatively, using sklearn's LinearRegression to calculate the coefficients and R-squared
model_sklearn = LinearRegression()
model_sklearn.fit(X[['Gjornal', 'GTV']], y)

# Coefficients and intercept
print("\nLinear Regression Coefficients (sklearn):")
print("Intercept:", model_sklearn.intercept_)
print("Coefficients:", model_sklearn.coef_)

# Predicting with the model
y_pred = model_sklearn.predict(X[['Gjornal', 'GTV']])

# Calculating R-squared and Mean Squared Error (MSE)
r2 = r2_score(y, y_pred)
mse = mean_squared_error(y, y_pred)

print(f"\nR-squared: {r2:.4f}")
print(f"Mean Squared Error: {mse:.4f}")

# Plotting the actual vs predicted Vendas
plt.scatter(y, y_pred)
plt.plot([y.min(), y.max()], [y.min(), y.max()], '--k', color='red')  # line of perfect prediction
plt.xlabel('Actual Vendas')
plt.ylabel('Predicted Vendas')
plt.title('Actual vs Predicted Vendas')
plt.show()

In the regression model, Gmdireta was excluded as an independent variable. The correlation matrix is now calculated using only Vendas, Gjornal, and GTV. Independent Variables (X):

We now use only Gjornal and GTV as the independent variables for the regression analysis. The variable Gmdireta is no longer included in the model. Explanation of the Code: Correlation Matrix:

We calculate the correlation matrix to examine the relationships between Vendas, Gjornal, and GTV only (without Gmdireta). Multiple Linear Regression (statsmodels):

We perform the Multiple Linear Regression with Gjornal and GTV as independent variables. The regression summary will now show the coefficients, p-values, R-squared, and other statistics for the model with the reduced set of independent variables. Linear Regression (sklearn):

We also use sklearn.linear_model.LinearRegression() to perform the regression and output the intercept and coefficients for the model without Gmdireta.


After fitting the regression model, we calculate the predicted values y_pred for Vendas using the new model (without Gmdireta).

We calculate the R-squared and Mean Squared Error (MSE) to evaluate the model's performance.

The R-squared tells us how much of the variance in Vendas is explained by Gjornal and GTV.

The MSE tells us the average squared difference between the predicted and actual values.


The plot visualizes how well the predicted Vendas values match the actual values. The red line represents the ideal case where the predicted values equal the actual values.


Example of Expected Output (Updated Model Summary):



                            OLS Regression Results
==============================================================================
Dep. Variable:                 Vendas   R-squared:                       0.976
Model:                            OLS   Adj. R-squared:                  0.974
Method:                 Least Squares   F-statistic:                     320.3
Date:                Thu, 10 Mar 2025   Prob (F-statistic):           4.23e-10
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         10.6345      2.591      4.107      0.000       5.146      16.123
Gjornal        2.8951      0.453      6.394      0.000       1.987       3.803
GTV            1.3547      0.290      4.673      0.000       0.778       1.931
==============================================================================

R-squared: A value of 0.976 means that the independent variables Gjornal and GTV explain 97.6% of the variance in Vendas, which is a good fit.



The coefficient for Gjornal (2.8951) tells us that for each unit increase in Gjornal, Vendas increases by approximately 2.90 units.

The coefficient for GTV (1.3547) tells us that for each unit increase in GTV, Vendas increases by approximately 1.35 units.


P-values: Both Gjornal and GTV have very small p-values (much smaller than 0.05), indicating that they are statistically significant in predicting Vendas.

By removing the variable Gmdireta (which had a p-value greater than 0.05), the regression model now focuses on the variables that have a stronger statistical relationship with the dependent variable Vendas.


8. Logistic Regression]

Click here to access Theoretical and Pratical Material.

Click here to access the Dataset

Click here and access Logistic Refression Basic Code

Click here and access Logistic Refression Exercices + Codes and Datasets


9. Discriminant Analysis - Iris

Click here to access Theoretical and Pratical Material.

Click here to access the Dataset

Click here and access Discriminant Basic Code - Iris

Click here and access Discriminant Exercices + Codes and

Click here and access Discriminant Exercices in Excel

Click here and acess Iris Repository


10. Lasso, Ridge and Ecolastic Net Regression: Complete Overview with Definitions, Regularization Concepts, and Implementation Steps


Click here to access Theoretical and Pratical Material.

Click here to access the Dataset

Click here and access Lasso Regression Basic Code


Lasso Regression (L1 Regularization)

Lasso regression (Least Absolute Shrinkage and Selection Operator) is a linear regression technique that applies L1 regularization by adding a penalty term equal to the sum of the absolute values of the coefficients to the ordinary least squares (OLS) cost function. This method performs variable selection and regularization simultaneously, shrinking less important coefficients to zero to reduce overfitting and enhance model interpretability.

Used in genomics, finance, and machine learning for tasks requiring model simplicity, multicollinearity handling, and automated feature selection.

$\arg\min_{\mathbf{w}} \left( \text{MSE}(\mathbf{w}) + \lambda |\mathbf{w}|_1 \right)$

Where $(\lambda)$ controls the regularization strength.

Ridge Regression (L2 Regularization):

Ridge regression applies L2 regularization by adding a penalty term equal to the sum of the squared coefficients to the OLS cost function. It shrinks coefficients toward zero but does not eliminate them, making it effective for handling multicollinearity and improving model stability.

Objective Function:

$\arg\min_{\mathbf{w}} \left( \text{MSE}(\mathbf{w}) + \lambda |\mathbf{w}|_2^2 \right)#

Where: $(\lambda)$ controls the strength of regularization.


Regularization techniques add a penalty term to the loss function to discourage complex models that overfit the training data by fitting noise instead of underlying patterns. The penalty term controls the magnitude of the coefficients, reducing variance at the cost of introducing some bias (bias-variance tradeoff). This leads to better generalization on unseen data.

  • L1 penalty (Lasso) encourages sparsity by forcing some coefficients to be exactly zero, effectively performing feature selection.
  • L2 penalty (Ridge) shrinks coefficients uniformly but keeps all features, improving stability especially when predictors are correlated.

The regularization parameter (\lambda) controls the strength of the penalty:

  • Higher (\lambda) means stronger shrinkage and simpler models.
  • Optimal (\lambda) is usually found via cross-validation.

Regularization Type Penalization Term Effect
L1 (Lasso) (\lambda \sum_i w_i
L2 (Ridge) (\lambda \sum_i w_i^2) Shrinks coefficients towards zero without elimination
Elastic Net (\lambda_1 \sum_i w_i

Aspect Lasso (L1) Ridge (L2)
Penalty Term (\lambda \sum w_i
Feature Selection Yes (sparse models, some coefficients exactly zero) No (all coefficients shrunk but retained)
Handles Multicollinearity Partially Yes (stabilizes correlated predictors)
Use Cases High-dimensional data with sparse signals Correlated predictors, small sample sizes
Coefficients Impact Some coefficients become zero Coefficients shrink but remain nonzero

Step 1: Data Preparation

  • Standardize features to zero mean and unit variance because Ridge regression is sensitive to feature scales.
  • Split dataset into training and testing sets (e.g., 80% train, 20% test).

Step 2: Model Training and Hyperparameter Tuning (Python Example)

from sklearn.linear_model import RidgeCVfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_squared_errorimport numpy as np

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Scale features
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define candidate lambdas (alphas)
alphas = 0.001, 0.01, 0.1, 1, 10, 100

# Train Ridge with cross-validation
ridge_cv = RidgeCV(alphas=alphas, cv=5).fit(X_train_scaled, y_train)
best_alpha = ridge_cv.alpha_

# Predictions and evaluation
y_pred = ridge_cv.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(fBest lambda: {best_alpha}, Test RMSE: {rmse}”)

Comparing Reults - Lasso, Ridge, and Elastic Net Regression

Here we demonstrates how to apply and compare three popular linear regression techniques-Lasso, Ridge, and Elastic Net-using Python. The models are evaluated using a real estate dataset to predict property values.


The goal of this project is to:

  • Load and preprocess a real estate dataset.
  • Train and evaluate Lasso, Ridge, and Elastic Net regression models.
  • Compare their performance using Mean Squared Error (MSE) and R² score.
  • Visualize the results for easy interpretation.

  • Python 3.7+
  • pandas
  • numpy
  • matplotlib
  • seaborn
  • scikit-learn

pip install pandas numpy matplotlib seaborn scikit-learn

➢ Get the Dataset

  • The dataset should be in Excel format (.xlsx) and must contain a column named Valor (the target variable).
  • Update the file path in the code as needed:

dados = pd.read_excel('path/to/your/Imoveis.xlsx')

  1. Clone this repository or copy the code.
  2. Ensure your dataset is available and the path is correct.
  3. Install the required libraries.
  4. Run the script:

python your_script_name.py

The script follows these steps:


# Importing essential libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

Setting dark mode for plots


plt.style.use('dark_background')
sns.set_palette("deep")

Loading the dataset

# Replace the file path with the correct one if necessary

dados = pd.read_excel('/Users/fabicampanari/Desktop/class_6-Lasso Regression/project_6-Lasso-Roidge-Elastic-Regression/Imoveis (1).xlsx')

Replace the file path with the correct one if necessary

dados = pd.read_excel('/Users/fabicampanari/Desktop/class_6-Lasso Regression/project_6-Lasso-Roidge-Elastic-Regression/Imoveis (1).xlsx')

Displaying the first few rows of the dataset

print(dados.head())

Separating predictors (X) and the target variable (y)

X = dados.drop(columns=['Valor'])  \# Replace 'Value' with the actual column name for the target variable
y = dados['Valor']

Splitting the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Standardizing the predictors

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Fitting a Lasso Regression model

lasso = Lasso(alpha=0.1, random_state=42)  \# Adjust alpha as needed
lasso.fit(X_train_scaled, y_train)

Making predictions

y_pred_lasso = lasso.predict(X_test_scaled)

Evaluating the model

mse_lasso = mean_squared_error(y_test, y_pred_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)

Printing results

print(f"Lasso Regression - MSE: {mse_lasso}, R2: {r2_lasso}")

Fitting a Ridge Regression model

ridge = Ridge(alpha=0.1, random_state=42)  \# Adjust alpha as needed
ridge.fit(X_train_scaled, y_train)

Making predictions

y_pred_ridge = ridge.predict(X_test_scaled)

Evaluating the model

mse_ridge = mean_squared_error(y_test, y_pred_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)

Printing results

print(f"Ridge Regression - MSE: {mse_ridge}, R2: {r2_ridge}")

Fitting an Elastic Net model

elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42)  \# Adjust alpha and l1_ratio as needed
elastic_net.fit(X_train_scaled, y_train)

Making predictions

y_pred_elastic = elastic_net.predict(X_test_scaled)

Evaluating the model

mse_elastic = mean_squared_error(y_test, y_pred_elastic)
r2_elastic = r2_score(y_test, y_pred_elastic)

Printing results

print(f"Elastic Net Regression - MSE: {mse_elastic}, R2: {r2_elastic}")

Comparing the results of the three models

results = pd.DataFrame({
'Model': ['Lasso', 'Ridge', 'Elastic Net'],
'MSE': [mse_lasso, mse_ridge, mse_elastic],
'R2': [r2_lasso, r2_ridge, r2_elastic]
})

print(results)


Plotting the comparison of MSE

plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='MSE', data=results)
plt.title('Comparison of MSE Across Models', fontsize=16)
plt.xlabel('Regression Model', fontsize=14)
plt.ylabel('Mean Squared Error', fontsize=14)
plt.show()

Plotting the comparison of R2

plt.figure(figsize=(10, 6))
sns.barplot(x='Model', y='R2', data=results)
plt.title('Comparison of R2 Across Models', fontsize=16)
plt.xlabel('Regression Model', fontsize=14)
plt.ylabel('R2 Score', fontsize=14)
plt.show()

11. - Brains Made of Code: Regression Training with Gradient Descent and Stochastic Optimization Algorithms


Click here to access Theoretical and Pratical Material.

Click here to access the Dataset

Click here to access the code implementations for Gradient Descent and Stochastic Gradient Descent


This section provides a concise overview of the fundamental concepts behind Artificial Neural Networks (ANNs) and the Gradient Descent algorithm used for training them. It introduces key ideas such as the perceptron model, multilayer perceptrons (MLPs), error correction learning, and the backpropagation algorithm.

For a complete and detailed explanation, including all theoretical background, mathematical derivations, and implementation code, please refer to the dedicated repository Brains Made of Code: Regression Training with Gradient Descent and Stochastic Optimization Algorithms, which contains the full content and examples.

This summary is part of a larger collection of topics related to neural tworks and machine learning, designed to provide both conceptual understanding and practical tools.



Copyright 2025 Quantum Software Development. Code released under the MIT License.