Pregnancies | Pregnancy Number |
GlucoseOral | 2-hour plasma glucose concentration |
BloodPressure | Blood Pressure (Diastolic) (mm Hg) |
SkinThickness | Skin Thickness |
Insulin | 2-hour serum insulin (mu U/ml) |
DiabetesPedigreeFunction | |
BMI | Body Mass Index Value |
Age | Age |
Outcome | Having the disease (1) or not (0) |
Step 2: Identify numerical and categorical variables.
Step 3: Analyze numerical and categorical variables.
Step 4: Conduct target variable analysis (Mean of the target variable by categorical variables, mean of numerical variables by the target variable).
Step 5: Perform outlier analysis.
Step 6: Carry out missing data analysis.
Step 7: Perform correlation analysis.
Step 1: Handle missing and outlier values. In the dataset, there are no missing observations, but some values such as 0 in variables like Glucose or Insulin may indicate missing values. For example, a person's Glucose or Insulin value cannot be 0. You can consider replacing these 0 values with NaN and then apply the necessary operations for missing values.
Step 2: Create new features.
Step 3: Perform encoding operations.
Step 4: Standardize numerical variables.
Step 5: Build a model.