[π§π· PortuguΓͺs] [πΊπΈ English]
7- Data Mining / Regression Techniques with Data Integration
Institution: Pontifical Catholic University of SΓ£o Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science
Semester: 2nd Semester 2025
Professor: Professor Doctor in Mathematics Daniel Rodrigues da Silva
Important
- Projects and deliverables may be made publicly available whenever possible.
- The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
- All activities comply with the academic and ethical guidelines of PUC-SP.
- Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.
πΆ Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4
πΊ For better resolution, watch the video on YouTube.
Tip
This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.
β Access Data Mining Main Repository
This repository covers fundamental concepts and practical techniques in Data Mining focused on clustering (grouping by similarity), various types of regression for modeling data trends, and the crucial steps for data integration and preprocessing. Each section includes theoretical explanations, use case examples, mathematical formulations using LaTeX, and Python code snippets to assist practical understanding.
Clustering is a technique to group data points that are similar and separate those that are different. Similarity is often measured numerically using distances in a coordinate space.
The most common distance metric used is Euclidean distance between two points
d(a, b) = \sqrt{\sum_{i=1}^n (a_i - b_i)^2}
This distance represents the shortest straight line distance between two points in n-dimensional space, calculated using the Pythagorean theorem.
Suppose you want to segment customers into groups based on age, income, and spending score to personalize marketing strategies. Clustering algorithms group similar customers by minimizing their Euclidean distances within clusters.
Regression models describe the relationship between a dependent variable and one or more independent variables, often used to predict trends or outcomes.
Before selecting a regression model, it is essential to plot a scatter plot of the data to understand how points are distributed and determine the best fitting model. The scatter plot reveals if data follows linear, exponential, polynomial, or logistic patterns.
Regression Type | Description | Equation / Model |
---|---|---|
Linear Regression | Models data with a straight line, assumes linear relationship. | |
Polynomial Regression | Fits a polynomial curve, useful for nonlinear trends. | |
Logarithmic Regression | Models data with logarithmic growth or decay. | |
Exponential Regression | Models exponential increase or decrease. | |
Logistic Regression | Used for predicting binary outcomes, S-shaped curve. |
In real-world applications, data often comes from various sources and needs to be integrated into a consistent dataset ready for mining and modeling.
Redundancy occurs when the same data appears multiple times, which can bias analysis or cause inconsistencies. A common case is duplicate rows in datasets.
# Code to detect duplicates and remove them
import pandas as pd
df = pd.read_csv('/content/cancer.csv')
print(df.duplicated().value_counts())
# Remove duplicates
df_clean = df.drop_duplicates()
Data conflicts arise when different sources provide inconsistent values for the same entity, for example, distances in kilometers vs miles or weights in kilograms vs pounds. Resolving conflicts often requires standardization to consistent units and formats.
Compression reduces data dimensionality or size to improve efficiency without losing essential information. Two main approaches:
-
Attribute Compression: Encoding or transforming attributes (e.g., PCA).
-
Data Reduction: Removing or summarizing instances or attributes.
PCA transforms correlated variables into a set of linearly uncorrelated components that capture the most variance in the data.
Standardization addresses differences in units and scales, converting data to a common formatβfor example, capitalizing strings to avoid case-sensitive mismatches or converting dates to a consistent format.
Normalization rescales numerical attributes so all variables have the same domain, essential for distance-based algorithms like clustering or neural networks.
Given an attribute value
a' = new\_min + \frac{(a - min_a)(new\_max - new\_min)}{max_a - min_a}
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
df = pd.read_csv('/content/cancer.csv')
scaler = MinMaxScaler(feature_range=(0, 1))
cols = list(df.columns)
cols.remove('id') # remove non-numeric or ID columns
cols.remove('diagnosis')
df[cols] = scaler.fit_transform(df[cols])
1. Castro, L. N. & Ferrari, D. G. (2016). Introduction to Data Mining: Basic Concepts, Algorithms, and Applications. Saraiva.
2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence β A Machine Learning Approach. 2nd Ed. LTC.
3. Larson & Farber (2015). Applied Statistics. Pearson.
πΈΰΉ My Contacts Hub
ββββββββββββββ πβ ββββββββββββββ
β£β’β€ Back to Top
Copyright 2025 Quantum Software Development. Code released under the MIT License license.