Heart disease prediction model with k-nearest neighbor algorithm

Received Jan 12, 2021 RevisedAug 20, 2021 Accepted Sep 2, 2021 In this study, the author proposed k-nearest neighbor (KNN) based heart disease prediction model. The author conducted an experiment to evaluate the performance of the proposed model. Moreover, the result of the experimental evaluation of the predictive performance of the proposed model is analyzed. To conduct the study, the author obtained heart disease data from Kaggle machine learning data repository. The dataset consists of 1025 observations of which 499 or 48.68% is heart disease negative and 526 or 51.32% is heart disease positive. Finally, the performance of KNN algorithm is analyzed on the test set. The result of performance analysis on the experimental results on the Kaggle heart disease data repository shows that the accuracy of the KNN is 91.99%.


INTRODUCTION
Heart disease is a condition in which a waxy substance is formed in the coronary arteries. This accumulation of plague waxy substance in the arteries makes the blood pumping process to slow down and eventually causes death if not [1]. Heart disease is one of the causes of disease and mortality among the population of the world. Prediction of cardiovascular disease is regarded as one of the most important research areas in clinical data analysis. Now a day, the amount of data in the healthcare centers is large. Machine learning algorithms are widely used in object recognition and disease diagnosis [2]. In disease diagnosis machine learning algorithm turns a large collection of healthcare dataset into information that can assist to make better decisions and predictions. Prediction of disease and developing machine-based diagnostics systems is one of the goals of machine learning research that gained importance in the medical research field in support of the health experts' herby improving the precision and accuracy in decision making process during the identification and diagnosis of a disease [3]-15] One the major problem in heart disease diagnosis is the error during diagnosis process. These errors occur due to lack of experienced specialists in the medical field to accurately and precisely identify the heart disease. Literature survey [1]- [25], shows that the heart disease is still a serious issue which needs further research works in order to address the mortality rate caused by the disease. In this research, we proposed heart disease prediction model by employing k-nearest neighbor (KNN) algorithm to and this research is aimed to answer the following questions: i) What is the right distance measure that produces the optimal accuracy for the KNN on heart disease prediction? ii) What is the performance of KNN algorithm on prediction of heart disease? iii) What is the effect of the value of neighbors on the predictive accuracy of KNN on heart disease prediction?

RELATED WORK
Numerous research works have been conducted which has focus on heart disease identification by employing machine-learning algorithms. The research works applied different machine learning algorithms to develop a prediction model for classification of the heart disease. Some of the previous research works on heart disease prediction are discussed in this section. Gavhane et al. [4], Naïve Bayes, decision tree and random forest algorithms are applied to Cleveland heart disease dataset. The predictive performance of the algorithms is evaluated on the test dataset and random forest algorithm outperformed than the decision tree and Naïve Bayes algorithm.
Hasan et al. [5], Gaussian Naïve Bayes algorithm is applied to an online University of California, Irvine (UCI) heart disease data repository. The algorithm is evaluated against the predictive accuracy and the experimental analysis of result shows that the highest accuracy achieved by the Gaussian Naïve Bayes on prediction of the heart disease is 84.05%.
Ambekar and Phalnikar [6], a comparative analysis on the predictive performance of machine learning algorithms, such as Gaussian Naïve Bayes, Logistic regression, random forest and KNN is conducted heart disease dataset. The comparison result shows that logistic regression outperformed the other algorithms with better accuracy on prediction.
Pawlovsky [7], heart disease prediction model is proposed by employing convolutional neural network (CNN). The accuracy of the proposed heart disease prediction model is evaluated on test dataset and the analysis of the result shows that the CNN algorithm achieved a prediction accuracy of 65%. Zunaidi et al. [8], KNN is applied to heart disease observations collected from Wisconsin. The authors compared the performance of linear and non-linear support vector machine. The result of performance analysis shows that the KNN has predictive accuracy of 84.8% on the heart disease classification problem.
Jothi et al. [9], a comparative study on machine learning algorithms namely, decision tree, random forest and multi-layer perception is conducted on the Wisconsin heart disease data repository. The algorithms are evaluated against their accuracy on heart disease prediction and the result shows that multi-layer perception, neural network is better on prediction of the heart disease. Jabbar et al. [10], support vector machine is applied to the heart disease data repository to develop a heart disease prediction model. The authors applied feature selection to improve the prediction performance of the proposed model and result shows that the model has accuracy of 56.16%.
Assegie et al. [11], Naïve Bayes is employed to Wisconsin heart disease data repository to predict a heart disease. The maximum prediction accuracy achieved by using this model is 87%. A prediction accuracy of 87% is acceptable in machine learning and prediction system and hence, Naïve Bayes model is better in performance and prediction of heart disease.

RESEARCH METHOD
In this research, the researcher collected heart disease data from Kaggle data repository for training and testing the proposed KNN model. For implementation and experimental testing, the researcher employed Python 3.7 programming language. A statistical method that is Pearson's correlation analysis and data visualization as well as feature relationship measures are employed for identification and interpretation of heart disease data repository to find out the relationship between the class and the features in observations. To develop heart disease prediction, model the researcher employed KNN algorithm. Figure 1 demosntrates heart disease distribution in the datasset.

Dataset description
In this study, Kaggle breast cancer data repository used in this study consists of 1025 observations and 13 features. Among the 1025 observations, 499 or 48.68% are heart disease negative and 526 or 51.32% are heart disease positive. The dataset has no missing feature values. Table 1 summarizes the details of the features of heart disease dataset. The dataset observations used in training is 80% and in testing 20% of the dataset is used. Table 1 demonstrates heart disease dataset features employed for training and testing the KNN model. Figure 2 shows the distribution of heart disease patients against non patient class for different slopping conditions such as sup sloping, down slopping and flat. As demonstrated in the Figure 2, the number of patients is higher when the patient ST-T wave is up sloping.

Feature correlation model
The author has employed Pearson's correlation analysis for visualization of the relationship between each feature. This helps to identify the feature that is strongly related to the class feature in the data repository. The Pearson's correlation matrix for each feature of the breast cancer dataset is shown in Figure 3. As illustrated in Figure 3, some of the features are highly correlated. For instance, age and total resting blood pressures (trestbps) has correlation value 0.27. Similarity cholesterol is highly correlated to age with correlation coefficient 0.22. In addition, number of major vessels has high correlation with age with correlation coefficient of 0.25. Slope and maximum heart rate achieved has high correlation value of 0.4. In contrast, features such as resting electrocardiogram and exercise-induced angina has negative correlation value with age feature.

RESULT AND DISCUSSION
In this section, the experimental test results on the proposed model are explained. The predictive performance of decision tree and adaptive boosting algorithm is analyzed by employing the performance metrics such as accuracy and confusion matrix along with learning curve of the algorithms. Table 2 illustrates the accuracy of the proposed KNN model on five random tests.
As demonstrated in Table 1, the highest accuracy score on five random test is 92.68% with average accuracy of x%. The predictive performance of the proposed model is experimented on the training set. The predictive accuracy of the proposed model is shown in Figure 4.

Confusion matrix
A confusion matrix is a measure the predictive performance of the proposed models in terms of the number of correct and incorrect predictions on the test set by the decision tree and adaptive boosting algorithm. The confusion matrix of the decision tree and adaptive boosting algorithm is shown in Figure 5.

Training and test accuracy vs k-values
Learning curves of the proposed model shows the performance of the model on training set for different k-values as demonstrated in Figure 6. The Figure 6 demonstrates the training and test set accuracy on the y-axis against the k-neighbors on the y-axis. The worst performance of the model is approximately 72.25%, which is still acceptable. And the model's best performance is at 1 neighbor ( = 1) and drops with higher values of neighbors.

CONCLUSION
In this research, author proposed KNN based model for heart disease prediction by using dataset obtained from Kaggle machine learning data repository. The proposed model solves the problem of biased classification on imbalanced observation by non-ensemble algorithm through ensemble classifier namely the adaptive boosting. The predictive performance of the proposed model is evaluated by employing different performance metrics such as accuracy and confusion matrix on the test set. The result of performance analysis shows that the adaptive boosting algorithm has better performance than the decision tree. Hence, the adaptive boosting algorithm is a better classifier for imbalanced observations where the use of non-ensemble algorithm such as decision tree, results in biased prediction towards the majority class yielding better performance on prediction of the majority class and poor performance on the minority class.