Name	Name	Last commit message	Last commit date
Latest commit History 62 Commits
Code_Duplicatiness Detection	Code_Duplicatiness Detection
Normalization_Code	Normalization_Code
Workbook	Workbook
dataset_ruspin	dataset_ruspin
dataset_zoo	dataset_zoo
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md

[🇧🇷 Português] [🇺🇸 English]

7- Data Mining / Regression Techniques with Data Integration

Institution: Pontifical Catholic University of São Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva

Important

⚠️ Heads Up

Projects and deliverables may be made publicly available whenever possible.
The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
All activities comply with the academic and ethical guidelines of PUC-SP.
Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4

📺 For better resolution, watch the video on YouTube.

Tip

This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.

☞ Access Data Mining Main Repository

Overview

This repository covers fundamental concepts and practical techniques in Data Mining focused on clustering (grouping by similarity), various types of regression for modeling data trends, and the crucial steps for data integration and preprocessing. Each section includes theoretical explanations, use case examples, mathematical formulations using LaTeX, and Python code snippets to assist practical understanding.

Clustering
Regression Types
Data Integration

Clustering

Clustering is a technique to group data points that are similar and separate those that are different. Similarity is often measured numerically using distances in a coordinate space.

Euclidean Distance Formula

The most common distance metric used is Euclidean distance between two points $a = (a_1, a_2, ..., a_n)$ and $b = (b_1, b_2, ..., b_n)$:

$$ \Huge d(a, b) = \sqrt{\sum_{i=1}^n (a_i - b_i)^2} $$

d(a, b) = \sqrt{\sum_{i=1}^n (a_i - b_i)^2}

This distance represents the shortest straight line distance between two points in n-dimensional space, calculated using the Pythagorean theorem.

Use Case Example

Suppose you want to segment customers into groups based on age, income, and spending score to personalize marketing strategies. Clustering algorithms group similar customers by minimizing their Euclidean distances within clusters.

Regression Types

Regression models describe the relationship between a dependent variable and one or more independent variables, often used to predict trends or outcomes.

Importance of Scatter Plots

Before selecting a regression model, it is essential to plot a scatter plot of the data to understand how points are distributed and determine the best fitting model. The scatter plot reveals if data follows linear, exponential, polynomial, or logistic patterns.

Common Regression Types

Regression Type	Description	Equation / Model
Linear Regression	Models data with a straight line, assumes linear relationship.	$y = \beta_0 + \beta_1 x $
Polynomial Regression	Fits a polynomial curve, useful for nonlinear trends.	$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_n x^n $
Logarithmic Regression	Models data with logarithmic growth or decay.	$y = \beta_0 + \beta_1 \log(x) $
Exponential Regression	Models exponential increase or decrease.	$y = \beta_0 e^{\beta_1 x} $
Logistic Regression	Used for predicting binary outcomes, S-shaped curve.	$p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x)}} $

Data Integration

In real-world applications, data often comes from various sources and needs to be integrated into a consistent dataset ready for mining and modeling.

Data Redundancy and Duplicates

Redundancy occurs when the same data appears multiple times, which can bias analysis or cause inconsistencies. A common case is duplicate rows in datasets.

Python Code to Detect and Remove Duplicates

# Code to detect duplicates and remove them
import pandas as pd

df = pd.read_csv('/content/cancer.csv')
print(df.duplicated().value_counts())

# Remove duplicates
df_clean = df.drop_duplicates()

Data Conflicts

Data conflicts arise when different sources provide inconsistent values for the same entity, for example, distances in kilometers vs miles or weights in kilograms vs pounds. Resolving conflicts often requires standardization to consistent units and formats.

Data Compression

Compression reduces data dimensionality or size to improve efficiency without losing essential information. Two main approaches:

Attribute Compression: Encoding or transforming attributes (e.g., PCA).
Data Reduction: Removing or summarizing instances or attributes.

PCA - Principal Component Analysis

PCA transforms correlated variables into a set of linearly uncorrelated components that capture the most variance in the data.

Data Standardization

Standardization addresses differences in units and scales, converting data to a common format—for example, capitalizing strings to avoid case-sensitive mismatches or converting dates to a consistent format.

Data Normalization

Normalization rescales numerical attributes so all variables have the same domain, essential for distance-based algorithms like clustering or neural networks.

Given an attribute value $a$, minimum $min_a$, and maximum $max_a$, the normalized value $a'$ in the range $[new_min, new_max]$ is:

$$ \Huge a' = new_min + \frac{(a - min_a)(new_max - new_min)}{max_a - min_a} $$

a' = new\_min + \frac{(a - min_a)(new\_max - new\_min)}{max_a - min_a}

Python Example for Max-Min Normalization

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv('/content/cancer.csv')
scaler = MinMaxScaler(feature_range=(0, 1))

cols = list(df.columns)
cols.remove('id')  # remove non-numeric or ID columns
cols.remove('diagnosis')

df[cols] = scaler.fit_transform(df[cols])

Bibliography

1. Castro, L. N. & Ferrari, D. G. (2016). Introduction to Data Mining: Basic Concepts, Algorithms, and Applications. Saraiva.

2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence – A Machine Learning Approach. 2nd Ed. LTC.

3. Larson & Farber (2015). Applied Statistics. Pearson.

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

────────────── 🔭⋆ ──────────────

➣➢➤ Back to Top

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

7- Data Mining / Regression Techniques with Data Integration

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

☞ Access Data Mining Main Repository

Overview

Table of Contents

Clustering

Euclidean Distance Formula

Use Case Example

Regression Types

Importance of Scatter Plots

Common Regression Types

Data Integration

Data Redundancy and Duplicates

Python Code to Detect and Remove Duplicates

Data Conflicts

Data Compression

PCA - Principal Component Analysis

Data Standardization

Data Normalization

Python Example for Max-Min Normalization

Bibliography

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Uh oh!

Sponsor this project

Uh oh!

Uh oh!

Languages

Uh oh!

License

Quantum-Software-Development/7-DataMining-Regression-Techniques-Data-Integration

Folders and files

Latest commit

History

Repository files navigation

7- Data Mining / Regression Techniques with Data Integration

🎶 Prelude Suite no.1 (J. S. Bach) - Sound Design Remix

📺 For better resolution, watch the video on YouTube.

☞ Access Data Mining Main Repository

💌 Let the data flow... Ping Me !

🛸๋ My Contacts Hub

Copyright 2025 Quantum Software Development. Code released under the MIT License license.

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Languages