BREAST CANCER PREDDICTION (KNN ALGORITHM)
Used machine learning methods to construct and analyse the performance of some selected algorithms for breast cancer diagnosis. K-Nearest Neighbour (KNN) algorithm was used for the diagnosis of Wisconsin Dataset. The precision, recall, accuracy, sensitivity, specificity was found. The precision and recall values were considerably increased by the application of the steps such as 1. Feature scaling 2. Dimensionality reduction 3. Cross Validation 4. Hyperparameter Optimization.
METHODOLOGY USED
A.KNN(K Nearest Neighbour) The KNN algorithm is an example of supervised machine learning algorithm. In a labelled data it can be used to solve both classification and regression problems. A supervised machine algorithm that depends on labelled input data to learn a function that produces a suitable output when a new unlabelled data is given. In classification techniques, the objects are classified based on the k nearest training examples in the feature space. The principle that lies behind KNN is that it assumes that similar data points lie in same surroundings. It reduces the need of building a model, fitting parameters, or developing more assumptions
B. Cross Validation Cross-validation, is a model validation technique.it is used mostly in environments and understands how reliable a predictive model is performing practically. The objective of cross-validation is to outline an information set to check the model within the coaching part in order to avoid overfitting, underfitting and to get an idea on how the model will generalize to an independent data. The validation and training set must be in the same distribution otherwise it would not make things that accurate.
C. Dimensionality Reduction Dimensionality Reduction is a technique that reduces the number of independent variables to a collection of theory variables by eliminating certain values that are less important in predicting the results. It is basically used to get two-dimensional data so that a better visualization of machine learning models can be done by plotting the prediction regions. There may be many independent variables, but we have two independent variables at last by applying a suitable dimensionality reduction technique. There are two processes, collection of Feature and Extraction of feature.
D. Feature Selection The subset of original features is found by different methods based on the information they provide, their accuracy, errors is basically known as Feature Selection.
E. Feature projection Transforming high-dimensional space data into a lower dimensional space (with less attributes). Depending on the form of relationships between the features in the dataset both linear and nonlinear reduction techniques can be used.
F. Principle Component Analysis (PCA) Principle Component Analysis (PCA) is an unsupervised linear dimensionality reduction algorithm in which it does not use the output information, the criterion to be maximised is the variance. This is used to find the strong features based on the covariance matrix related to the dataset. It reduces large number of dimensions to fewer dimensions.