Consider the problem of predicting a label y on the basis of a vector of features x=(x1,…,xd). Let fk(x)≡P[X=x | Y=k] denote the density function of X for an observation that comes from the kth class. In other words, fk(x) is relatively large if there is a high probability that an observation in the kth class has X≈x, and fk(x) is small if it is very unlikely that an observation in the kth class has X≈x. Let πk represent the overall or prior probability that a randomly chosen observation comes from the kth class; this is the probability that a given observation is associated with the kth category of the response variable Y. Also, let pk(X)=P(Y=k | X). We refer to pk(x) as the posterior probability that an observation X=x belongs to the kth class. That is, it is the probability that the observation belongs to the kth class, given the predictor value for that observation.
Then Bayes’ theorem states that:
pk(X)=P(Y=k | X=x)=πkfk(x)∑Kl=1πlfl(x).The Bayes optimal classifier is given by-
hBayes(x)=argmaxk pk(X)=argmaxk πkfk(x)The Bayes classifier, which classifies an observation to the class for which pk(X) is largest, has the lowest possible error rate out of all classifiers. This is, of course, only true if the terms in Eq. 1 are all correctly specified.
To describe the probability function pk(X), we need kd parameters. This implies that the number of examples we need grows exponentially with the number of features. In general, estimating πk is easy if we have a random sample of Ys from the population: we simply compute the fraction of the training observations that belong to the kth class i.e. πk=ˆπk=nkn, where nk is the number of observations in class k. However, estimating fk(X) tends to be more challenging, unless we assume some simple forms for these densities. Therefore, if we can find a way to estimate fk(X), then we can develop a classifier that approximates the Bayes classifier.
Suppose that ˆfk(x) is a reasonably good estimator of fk(x). Then we could approximate the Bayes rule in Eq. 2 by-
hBayes(x)=argmaxk ˆπkˆfk(x)The kernel density classifier uses a nonparametric kernel density estimator to estimate the density function of X given Y, and then plugs the density estimator into the Bayes formula to estimate the Bayes rule. Let X1,…,Xn be a random sample from a d-dimensional distribution with density f(x). Then a kernel density estimator of f(x) is given by
ˆfh(x)=1nn∑i=1h−dK(Xi−xh),where K is a non-negative density function, called a kernel function, so that f(x) is a legitimate density function. For example, a commonly-used kernel function is the standard Gaussian density. The parameter h is called the bandwidth. KDE simply distributes the point Xi by a smoothed function:
h−dK(Xi−xh)for a small h. Typically, the kernel function is fixed, and the bandwidth h is chosen to trade off biases and variances. Basically, the optimal h depends on the kernel function and the underlying density function. For each sub-sample of class k, we apply the kernel density estimator to construct fk(x) and then use the argmax rule to predict the class label at x.
The kernel density classifier is quite straightforward both conceptually and computationally, but it is not recommended when the dimension is 3 or higher, due to the “curse-of-dimensionality”.
Linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA) are model-based methods for classification. LDA and QDA models start with the assumption that the conditional distribution of X given Y is a multivariate normal distribution (also known as Gaussian Distribution) i.e. (X\ \vert \ y = k) \sim N(μ_k, Σ_k). Hence,
f_k(x) = \frac{1}{\sqrt{2\pi \ \vert \Sigma_k\ \vert}} \exp\left(-\frac{1}{2}(x - μ_k)^T Σ_k^{-1}(x - μ_k)\right)Under this assumption, the Bayes rule as in Eq. 2 can be written as
\underset{k}{\text{argmax}} \ \delta_k(x)where
\delta_k(x) = \log\pi_k - \frac{1}{2}\log\ \vert \ \Sigma_k\ \vert \ - \frac{1}{2}(x - μ_k)^T \Sigma_k^{-1}(x - μ_k) \tag{4}We can see that the Bayes rule is a quadratic function of x, hence the name Quadratic Discriminant. This classification rule is very intuitive: except for some constants to reflect prior knowledge, the QDA classifies a point x according to its Mahalanobis distance to the centroids μ_k, defined by \left(x - μ_k\right)^T \Sigma_k^{-1}\left(x - μ_k\right). When \pi_k and \Sigma_k are independent of k, this is exactly the nearest centroid classification.
QDA substitutes the parameters in (4) with the following estimates:
\begin{eqnarray} \hat{\pi}_k & = & \frac{n_k}{n}, \\ \hat{\mu}_k & = & \frac{1}{n_k} \sum_{Y_i=c_k} X_i, \\ \hat{\Sigma}_k & = & \frac{1}{n_k - 1} \sum_{Y_i=c_k} (X_i - \hat{\mu}_k)(X_i - \hat{\mu}_k)^T, \end{eqnarray}where n_k is the number of observations in class k. Thus, the QDA rule is:
\underset{k}{\text{argmax}} \left\{\log \hat{\pi}_k - \frac{1}{2} \log \ \vert \ \hat{\Sigma}_k\ \vert \ - \frac{1}{2}(x - \hat{\mu}_k)^T \hat{\Sigma}_k^{-1}(x - \hat{\mu}_k)\right\}.LDA uses the additional homogeneous covariance assumption:
\Sigma_k = \Sigma \text{ for all } k.As a result, the quadratic term x^T \hat{\Sigma}_k^{-1} x and the \log \vert \hat{\Sigma}_k\vert are independent of k. Hence, the Bayes rule has a simpler form:
\underset{k}{\text{argmax}} \ \delta_{\text{lda}}^k(x)where
\delta_{\text{lda}}^k(x) = \mu_k^T \Sigma^{-1} x - \frac{1}{2}\mu_k^T \Sigma^{-1} \mu_k + \log \pi_k. \tag{5}Now the Bayes rule is a linear function of x and hence a linear discriminant. It is also referred to as Fisher’s discriminant analysis. Note that Fisher’s original derivation was geometric, not based on the probabilistic LDA model. In applications, it is observed that LDA can perform quite well although the LDA model assumptions are clearly violated. Let K be the number of classes. LDA estimates \Sigma with the pooled sample covariance:
\hat{\Sigma} = \frac{1}{\sum_{k=1}^{K} (n_k - 1)} \sum_{k=1}^{K} (n_k - 1) \hat{\Sigma}_k.The LDA rule is:
\underset{k}{\text{argmax}} \left\{\hat{\mu}_k^T \hat{\Sigma}^{-1} x - \frac{1}{2}\hat{\mu}_k^T \hat{\Sigma}^{-1} \hat{\mu}_k + \log \hat{\pi}_k\right\}. \tag{6}In particular, if there are only two classes \{0, 1\}, then the rule classifies an observation to Class 1 if and only if:
\log\left(\frac{P[Y = 1]P[X = x\ \vert \ Y = 1]}{P[Y = 0]P[X = x\ \vert \ Y = 0]}\right) > 0.This ratio is often called the log-likelihood ratio. In our case, this gives:
(x - \hat{\mu}_a)^T \hat{\Sigma}^{-1}(\hat{\mu}_1 - \hat{\mu}_0) + \log \frac{\hat{\pi}_1}{\hat{\pi}_0} > 0. \tag{7}where \mu_a = \frac{\hat{\mu}_0 + \hat{\mu}_1}{2}.
If we further assume that, \hat{\pi}_0 = \hat{\pi}_1, then Eq. 7 can be written as -
\frac{1}{2}(x - \mu_0)^T \Sigma^{-1}(x - \mu_0) - \frac{1}{2}(x - \mu_1)^T \Sigma^{-1}(x - \mu_1) > 0We can rewrite this as w^Tx + b > 0 where
w = (\mu_1 - \mu_0)^T \Sigma^{-1} \quad \text{and} \quad b = \frac{1}{2}\left(\mu_0^T \Sigma^{-1}\mu_0 - \mu_1^T \Sigma^{-1}\mu_1 \right)When p is moderately large, \Sigma cannot be accurately estimated for moderate sample size, and the computation of \hat{\Sigma}^{-1} can be very unstable. If p \geq n, \hat{\Sigma} is not even full rank. Friedman (1989) suggested regularized discriminant analysis (RDA) which uses the following shrinkage estimators:
\hat{\Sigma}^{\text{rda}}(\gamma) = \gamma \hat{\Sigma} + (1 - \gamma)\frac{\text{tr}(\hat{\Sigma})}{p}\hat{\mathbf{I}}, \quad 0 \leq \gamma \leq 1 \tag{8}is shrunken toward an identity matrix (up to a scalar) as \gamma \rightarrow 0 and
\hat{\Sigma}^{\text{rda}}_k(\alpha, \gamma) = \alpha \hat{\Sigma}_k + (1 - \alpha)\hat{\Sigma}^{\text{rda}}(\gamma). \tag{9}which is shrunken toward a common covariance matrix as \alpha \rightarrow 0. In practice, (α, γ) are chosen from the data by cross-validation.
The naive Bayes classifier can be viewed as a much simplified kernel density classifier when the dimension is relatively large. The basic idea is very simple. Assume that given the class label, the features are conditionally independent. That is,
f_k(x) = \prod_{j=1}^{d} f_{jk}(x_j) \text{ for all } k.The above independence assumption drastically simplifies the problem of density estimation. Instead of estimating a d-dimensional density function, we now estimate d univariate density functions. With this assumption and using the Bayes rule, the Bayes optimal classifier can be further simplified:
\begin{eqnarray} h_{\text{Bayes}}(x) & = & \underset{k}{\text{argmax}} \ \ P[Y = y \ \vert \ X = x] \nonumber \\ & = & \underset{k}{\text{argmax}} \ \ P[Y = y]P[X = x \ \vert \ Y = y] \nonumber \\ & = & \underset{k}{\text{argmax}} \ \ P[Y = y] \prod_{i=1}^d P[X_i = x_i \ \vert \ Y = y] \\ & = & \underset{k}{\text{argmax}} \ \hat{\pi}_k \prod_{j=1}^{d} \hat{f}_{jk}(x_j) \end{eqnarray}where \hat{f}_{jk}(x_j) is the univariate KDE for variable X_j based on the k-th class of data. If X_j is continuous, we can use the kernel density estimator for \hat{f}_{jk}(x_j). That is, now the number of parameters we need to estimate is only kd + 1. When we also estimate the parameters using the maximum likelihood principle, the resulting classifier is called the Naive Bayes classifier.
Although the conditional independence assumption is very convenient, it is rather naive and too optimistic to be remotely true in reality. Hence one might wonder if the naive Bayes classifier is practically useful at all. Surprisingly, naive Bayes classifiers have worked very well in many complex real-world applications such as text classification. A possible explanation is that although individual class density estimates (\prod_{j=1}^{d} f_{jk}(x_j)) are poor estimators for the joint conditional density of the predictor vector, they might be good enough to separate the most probable class from the rest. Another important use of the naive Bayes rule is to create augmented features.