CLASSIFICATION BINARY MODELS FOR BIOMEDICAL DATA: SIMPLE PROBABILISTIC NETWORKS AND LOGISTIC REGRESSION

In the biomedical area a critical factor is whether a classification model is accurate enough in order to provide correct classification whether or not a patient has a certain disease. Several techniques may be used in order to accommodate such situation. In this context, Bayesian networks have emerged as a practical classification technology with successful applications in many fields. At the same time, logistic regression is a widely used statistical classification method and evidenced in the literature. In the current paper we focus on investigating the preditive performance of a probabilistic networks in its simple particular case, the so called naive Bayes network, compared to the logistic regression. A systematic simulation study is performed and the procedures are illustrated in some benchmark biomedical data sets.


Introduction
Currently different biomedical types of data, such as medical diagnosis, sequences, protein structures and families, proteomics data, ontologies, gene expression and other experimental data are often collected in research centers.In this plot, classification is an essential task used to predict group membership for data instances.Therefore binary classification can be considered one particular case on classification and has been successfully applied to wide range of medical problems.
Thus many techniques can be used in the binary classification but methods with high performance are highly required to minimize risks where diagnosis mistakes can cost the life of the patients.In this context, Bayesian networks have emerged as a practical classification technology with successful applications in many fields and provides some advantages such as the ability to combine expert opinion and experimental data (NIELSEN et al., 2009;HECKERMAN et al., 1995).
Otherwise, logistic regression is a widely used statistical method, as evidenced in the literature (KING and ZENG, 2001).Alternatively other techniques are: probit analysis, mathematical programming, expert systems, neural networks, genetic algorithms and others (HAND and HENTLEY, 1997).
Generally, the best technique for all data sets does not exist, but we can compare a set of methods using some statistical criteria.Therefore, the main thrust of this paper is to investigate the ability of probabilistic networks in a simple particular case of naïve Bayes network, and so called simple probabilistic networks, compared to logistic regression.Then we compute a systematic confrontation through simulation and real data analysis involving both methods.The basic idea consists in applying the models to several replicated artificial datasets and some real datasets.Hence study the behavior of the specific statistical performance measures.
We only considered the naive Bayes network and logistic regression classification strategy because they are consolidated casual classification methods.
This paper is organized as follows.In Section 2 the naive Bayes network and logistic regression procedures like that ROC Curve and some performance measures are presented.In Section 3 we present the simulation results with artificial data and some analysis applied in benchmark biomedical real databases.We finish the paper with some final comments in Section 4.

Methodology
In this section we expose shortly the procedures of naïve Bayes network and logistic regression and how the ROC curve is applied in both methods.Also we present some statistical performance measures.

Naïve Bayes network
The naïve Bayes procedure, described by Good (1965), Duda and Hart (1973) and Flach and Lachiche (2004), is based in computing the posterior probability distribution P (Y |X) where Y = {y 1 , y 2 , ..., y p } is the class variable and X = {X 1 , X 2 , ..., X k } is a set of attribute variables that explain the domain.However, this classifier has strong independence assumption and this computation is quite feasible.In other words Rev. Bras. Biom., Lavras, v.36, n.1, p.48-55, 2018 -doi: 10.28951/rbb.v36i1.114Thus predict to the most plausible category through arg max Y P (Y |X).Besides, the naïve Bayes procedure can be interpreted as a simple probabilistic network.The Figure 1 shows naïve Bayes network and a particular case of probabilistic network.

Logistic regression
In a similar way, we consider a set of attribute variables X = {X 1 , X 2 , ..., X n } and a class variable with binary categories Y = {y 1 , y 2 }.Thereby the logistic regression method consists of appointing a linear relation between X and a logit transformation of Y .If we take the y 1 as the category in focus, this model can be represented as log π 1−π = Xβ where π = P (Y = y 1 ) and β the vector of coefficients.Hence a possible way to represent this model is π i = exp Xiβ 1−exp Xiβ where π i is the probability of the i-th patient belonging to the category of interest.Through specific considerations we can trace a cut-off point used in classification, in other words, set a C point and classify a patient i as a diseased in category y 1 on the study if π i > C.

Some performance measures
A misclassification takes place when the modeling procedure fails to correctly allocate a patient to its true category.Then the modeling procedure misclassification rates can be easily calculated.Thus, to control the misclassification we shall particularly consider the overall correct prediction rate also known as accuracy rate (ACC), but also the sensitivity (SEN ) and specificity (SP E).In this context we also consider the Matthews Correlation Coefficient (M CC) a balanced measure which can be used even if the classes are of very different sizes, it returns a value between -1 and +1.A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction (NIELSEN;RUMI and SALMER ÓN, 2009).The performance measures are defined as, where T P is the number of true positive, test positive in actually positive cases, F P is the number of false positive, test positive in actually negative cases, F N is number of false negative, test negative in actually positive cases, T N is the number of true negative, test negative in actually negative cases.

ROC Curve
The receiver operating characteristic (ROC) curve is an effective method of evaluating the quality or performance of diagnostic tests, and is widely used in several biomedical applications (PARK et al., 2004).
The ROC curve generalizes the notions of sensitivity (SEN ) and specificity (SP E) seeking maximum values for sensitivity and specificity measures.The lower misclassification error is guaranteed by higher area of this curve, so the point closer to the left corner can be considered like a cut-off point.
In this context, like often achieved with logistic regression, we can consider the naïve Bayes network for binary classification and apply the ROC curve to define a cut-off point to classify the posterior probability of a patient being assigned to the adequate group.

Experimental results
In this section we expose the application results considering the logistic regression and naïve Bayes network in some real and artificial datasets.
As a first set of experiments we consider for Blood Transufion, Breast Cancer, Diabetes and Statlog(Heart) benchmark database.All of these are in the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/).Table 1 gives a numerical summary of the data sets and their perfomence measures.We can observe a very similar performance.
As a final set of experiments we generated datasets according to a binary random variable indicating presence or absence of a particular disease.Thus, we achieved a comparative evaluate between both methods through a thorough simulation where we consider 399 replications, this number was used by Hall (1986) to construct confidence intervals for the boostrap technique.Then we consider a population with one class variable and ten attribute variables, so fixed the samples size at 100, 300, 1000 and 10000 elements.The attribute variables values in X were generated according to Breiman (1998), such distribution of patients without a particular disease has a 10-dimensional normal distribution with mean vector equals to (0, ..., 0) and covariance 4I 10 , the distribution of patients with a particular disease has a 10-dimentional normal distribution with mean vector equals to ( 1 √ 10 , ..., 1 √ 10 ) and covariance I 10 , where I 10 is identity matrix of order 10.Also we consider four setups, 50%, 25%,10% and 1% rates of patient with a particular disease.
Overall, four datasets were generated through the rates, hereafter called Setup 1, Setup 2, Setup 3 and Setup 4, respectively.
Hence we took four samples with different sizes.For all resamples we fitted the usual logistic regression model and naïve Bayes network.Table 2 shows the performance measures and the 95% confidence intervals based on their resample distributions.The interval results show that in both methods the performance measures are statistically equal, except for M CC measure in the largest sample from Setup 1 where naïve Bayes (NB) appears slightly better than logistic regression (LR), showing a significant improvement on this criterion.

Final comments
In this paper we observed a straight approximation between naïve Bayes network and logistic regression in biomedical data results.And statistically we observed equal classification perfomance with slightly naïve Bayes superioty in M CC measure in 50%-50% setup.

Figure 1 -
Figure 1 -In the left sub-image we present the traditional structure of naïve Bayes network and in the right sub-image a probabilistic network with six random variables is shown.

Table 1 -
Perfomance measures obtained by biomedical real data analysis