Electronic health record to predict a heart attack used data mining with Naïve Bayes method

ABSTRACT


INTRODUCTION
Cardiovascular disease (CVD) is the number one deadliest disease in the world and is on the rise in Asia. There are a number of factors that cause an increase in CVD such as sedentary lifestyle, unhealthy diet, and smoking. But you can change your life and reduce your risk of CVD and improve your quality of life [1]. Atherosclerosis is a chronic inflammatory disease; it's described the patchy intramural thickening of the subintima [2]. Cardiovascular disease (CVD) circulatory system which includes the heart and blood vessels. The circulatory system is important for keeping the body's organs functioning by transporting oxygen, nutrients, electrolytes, and hormones throughout the body. But when there is a disturbance or blockage in the heart or blood vessels, it will affect blood circulation and cause complications such as heart disease or stroke [3]. Acute myocardial infarction (AMI), or often referred to as a heart attack, is a decrease in blood flow in the coronary arteries due to occlusion, which is mostly caused by the process of atherosclerosis. Meanwhile, risk factors can be distinguished between modifiable risk factors and non-modifiable risk factors [4].
The use of big data from datasets can improve services to patients, detect the spread of disease early, generate new insights into disease mechanisms, monitor the quality of medical and health institutions and provide better treatment methods [5]. Cardiovascular disease has risk factors. Risk factors are a measure to determine the likelihood, can be seen in Table 1. Big data is a very large and quite complex way of collecting data where conventional data processing methods are not good enough. Therefore, big data will be analyzed so that patterns, or other habits, related to the organization or customers can be obtained [6]. Big data analysis refers to proper and good analysis so that it can be ensured that the decision-making process can be more accurate and the results of good performance again [7]. Characteristics of dig data is shown in Table 2. • Obesity are is an excessive accumulation of fat due to an imbalance between energy intake and energy used • lifestyle of Sedentary (rarely physical activity), • Diabetes disease, a chronic disease characterized by high blood sugar (glucose) levels • High cholesterol is condition when cholesterol levels in the blood exceed normal limits. • Hypertension is condition when blood pressure is at 130/80 mmHg or more • Gender, man is more at risk of coronary heart disease than women. Risk factors in women will increase after experiencing menopause • Age, a person's risk increases with age.
Usually at the age of 40 years, a person is advised to start checking his heart health • Gen factor, heredity from a family who has had a heart attack Value Importance of data It's indicated the business value derived from big data [11]. 5 Variability Data differentiation It's indicated to changes in data during processing and lifecycle [12]. 6 Veracity Quality of data It's indicated 2 aspects: consistency of data and trustworthiness of data [13].
After these patterns are found, they can be used to make certain decisions for further business development [14]. Some of the steps involved in it are: − To explore of data: The data is cleaned in the sense that nothing is lost and transformed into a different form so that other important variables which then type the data based on the problem have been determined. − Pattern identification: Form pattern identification. Identify and choose the pattern which make the best prediction. − Deployment: Patterns are deployed for the desired outcome.
Data mining is the process of analyzing data from different angles and summarizing results into useful information [15]. Data mining is an automated data analysis techniques to uncover previously undetected relationships among data items. Data mining also often involves the analysis of data stored in a data warehouse [16]. Data mining techniques can be applied in various aspects because data obtained from different sources can be different and out of sync. Specific technique is applied for specific types of problems to resolve efficiently [17]. Technique's data mining [18]: − Classification, this technique usually uses machine learning or machine learning techniques. This technique classifies items or variables in a data set into predetermined groups or classes. It uses linear programming, statistics, decision trees, and artificial neural networks, among other techniques. − Clustering, in clustering the data labeling process is not determined at the beginning, in contrast to the data group labeling classification that has been determined previously. Examples of clustering methods are K-means, C-means). − Regression is a technique used for determining that there is a relationship between the variable that's wanting to predict (the dependent variable) and other variables (the independent variable). Big data, in medical research, is used for electronic health record (EHR) considered "relevant" to the understanding of health and disease, including clinical, imaging, omics, data from internet use and wearable devices, and others [19]. In health care institutions, data mining tools answer the question rapidly, that are traditionally time-consuming and too complex to resolve [20]. Electronic health record (EHR) is facilitated services in terms of patient medical records. The EMR system or electronic medical record is a systematic collection of electronic-based health information that is connected and integrated with the information system in the hospital network [21]. Medical records are written or recorded information regarding identity, history taking, physical determination, laboratory, diagnosis of all medical services and actions provided to patients and treatment, both inpatients, outpatients and those receiving emergency services.

RESEARCH METHOD
The working of the method is described in a step by step [22]: (a) Data Selection: obtain the data resources from various sources. (b) Data preprocessing: is refer to manipulation or dropping of data before it is used in order to ensure or enhance performance and is an important step in the data mining process from the dataset.

Data selection
Big data in healthcare refers to the vast quantities of data-created by the mass adoption of the Internet and digitization of all sorts of information, including health records-too large or complex for traditional technology to make sense of. This clinical activity produces a large number of prints including patient record any information, diagnoses, treatment schemes, notes from doctors, and sensor data [23].
The dataset that will be past in this research is the "Heart disease UCI" dataset. This dataset is obtained from hospital in Indonesia. This dataset contains 14 attributes, the explanation of each attribute can be seen Table 3. However, this data must be pre-processed. Data pre-processing is one of the tasks in data mining, including the preparation and conversion of data into a form suitable for mining procedures [24].

Data preprocessing
In data preprocessing, the software used in this methodology is RapidMiner. By utilizing RapidMiner, the data processing process, to determine the variables that will be used in the process of grouping data, to clean up unwanted data.

Data analyst
Data analysis looks at the existing data and implements statistical and visualization methods to test hypotheses about the data and find exceptions. Data mining looks for and finds trends in the data, which can be used for further analysis in the future. Classification algorithms learn the labels of the samples and their nominal and/or numeric values as attributes and they create a model. After that, they make predictions about these generated models [25]. Naïve Bayes classification is a probabilistic model based on Naïve Bayes theorem. Naïve Bayes defined as a statistical classification. Naïve Bayes used for supervised learning [26]. The data mining extension (DMX) query language is used to create models, model training, model predictions, and model content access. All parameters are set to default settings except for the parameters "Minimum dependency probability = 0.05" for Naïve Bayes [27]. In this paper, we use the Naïve Bayes classification algorithm. The Naïve Bayes is a simple probabilistic classifier that is easy to apply and it performs can well on data sets with a high number of instances [28]. The rules of Naïve Bayes [29].
(1) Refers to a set of biological attributes in humans. It is primarily associated with physical and physiological features including chromosomes, gene expression, hormone levels and function, and reproductive/sexual anatomy. ["o" = female and "1" = male] CP Can be divided into heart-related chest pain (cardiac chest pain) and chest pain that is not from a heart condition (non-cardiac chest pain). Trestbps The pressure of circulating blood against the walls of blood vessels. Most of this pressure results from the heart pumping blood through the circulatory system. A blood sample will be taken after an overnight fast. A fasting blood sugar level less than 100 mg/dl than 120mg/dl sugar = one, if not = zero.

Resting Electrocardiographic
The heart is a muscular organ which pumps blood through rhythmic contractions induced by electric impulses generated by the sinus node, the heart's natural pacemaker. 0 = normal, 1 = has ST-T wave abnormality (T wave inversion), 2 = shows left ventricular hypertrophy. Thalac Is the maximum heart rate of patient? Exang Exercise induced angina(exang), ST depression induced by. Exercise relative to rest (old peak), the slope of the peak. Exercise ST segment(slope), number of major vessels. If Y, the value will be "1", and "N" for not.

Old Peak
At entry indicates severe coronary lesions and large benefits of an early invasive treatment strategy in unstable coronary artery disease between 0 and 6.2. Slope In a cardiac stress test, an ST depression of at least 1 mm after adenosine administration indicates a reversible ischemia, while an exercise stress test requires an ST depression of at least 2 mm to significantly indicate reversible ischemia. CA Fluoroscopy is a type of medical imaging that shows a continuous X-ray image on a monitor, much like an X-ray movie. Thal A thallium stress test is a nuclear medicine study that shows your physician how well blood flows through your heart muscle while you're exercising or at rest.

Implementation
The classification technique is used to create a model that can be used to predict whether a patient with a certain attribute has stroke or not. To do this, we reduced the attributes of the dataset according to the stroke risk factors mentioned above. The attributes are 'age', 'gender', 'hypertension', 'avg_glucose_level' to indicate whether someone has diabetes, 'heart_disease', 'Body mass index (BMI)' to indicate whether someone is obese, and 'smoking statuses. The 'stroke' attribute is also included as the label/class. Figure 2 shows how the operators in RapidMiner are configured to build the decision tree model. Before the optimize parameter, operator is the configuration for cleaning and reducing the dataset. This operator is a wrapper operator used to tune the parameters of the operator inside it. After being plugged into the optimize parameter operator, the dataset is split into training and testing data with 7:3 ratio respectively. Then, the training data is plugged into the decision tree operator to construct the decision tree model, with the criterion parameter set to 'information_gain'. Other parameters such as maximal_depth, minimal_leaf_size, confidence, and 'minimal_size_for_split will be tuned by the wrapper. The model then passed to the apply model operator together with the testing data. Which then passed to the performance operator to evaluate the accuracy of the decision tree model. The association technique is used to create association rules to find associations of the attributes in the dataset that are related to stroke. The FP-Growth operator we used accepts attributes with nominal or categorical values. Therefore, we chose the attributes 'gender', 'heart_disease', 'hypertension', 'smoking_statuses, and 'stroke'. Figure 3 shows the configuration of the operators used to create the association rules. The first 5 operators were the same as the one used on classification technique, with exception of the select attributes operator that now only selects the attributes mentioned above. The reduced dataset is connected to the FPgrowth operator with the parameter min_support is set to 0.3 and other parameters left default. Then the frequent itemset from the operator is passed to the create association rules operator to create the association rules. The parameter nonconfidence is set to 0.5, and other parameters are also left default. Figure 4 shows the clusters made by the clustering operator. There, we can see the 2 clusters with the average values of the attributes selected above. The first cluster (cluster_0) has 708 items and the second (cluster_1) has 4234 items. As we can see, the first cluster has the highest relative proportion of patients that have stroke at 12%. This cluster is consisted of patients with an average BMI of 31, age of 58, and average glucose level of 201. From this result, we can create an assumption that elder patients that are considered obese and have diabetes are more likely to have stroke. In the second cluster, only 3,7% of the 4234 patients have stroke. Which is consisted of patients within the age of 40, BMI of 27, and average glucose level of 90.

CONCLUSION
In this study, we can predict whether a person would potentially have heart disease or not. Therefore, the person can be treated before the disease gets worse or even prevent the disease from happening together. Classification is an appropriate data mining technique for processing heart disease datasets because the dataset used has target variables to be classified. Classification will classify data into groups of classes that already exist. There will be no formation of new groups. And the process is supervised. Different from clustering which is a process for grouping data into several clusters or groups so that the data in one cluster has similarities.
Almost every day there is an increase in the ratio of the art attacks. To reduce heart disease a system is needed to detect potential heart attacks. Big data must be analyzed first before taking a pattern that is useful for decision making. RapidMiner is used in this study to predict patients diagnosed with a heart attack using data mining techniques Naïve Bayes. We use naïve Bayes classification because in naïve Bayes classification we can determine target which can be used to answer some questions like whether the patient has the potential for heart disease. After data analyst, we can use data to electronic health record (EHR).