library(dmtr)
library(purrr)
library(dplyr)
library(stringr)

1 데이터

data(binaryclass2, package = "dmtr")

Table 1.1: 판별분석 학습표본 데이터
id	x1	x2	class
1	5	7	1
2	4	3	2
3	7	8	2
4	8	6	2
5	3	6	1
6	2	5	1
7	6	6	1
8	9	6	2
9	5	4	2

2 피셔 판별 함수

각 객체는 변수벡터 \(\mathbf{x} \in \mathbb{R}^p\)와 범주 \(y \in \{1, 2\}\)로 이루어진다고 하자. 아래는 변수 \(\mathbf{x}\)의 기대치와 분산-공분산행렬(varinace-covariance matrix)을 나타낸다.

\[\begin{eqnarray*} \boldsymbol\mu_1 = E(\mathbf{x} | y = 1)\\ \boldsymbol\mu_2 = E(\mathbf{x} | y = 2)\\ \boldsymbol\Sigma = Var(\mathbf{x} | y = 1) = Var(\mathbf{x} | y = 2) \end{eqnarray*}\]

다음과 같이 변수들의 선형조합으로 새로운 변수 \(z\)를 형성하는 함수를 피셔 판별함수(Fisher’s discriminant function)라 한다.

\[\begin{equation} z = \mathbf{w}^\top \mathbf{x} \tag{2.1} \end{equation}\]

여기서 계수벡터 \(\mathbf{w} \in \mathbb{R}^p\)는 통상 아래와 같이 변수 \(z\)의 범주간 평균 차이 대 변수 \(z\)의 분산의 비율을 최대화하는 것으로 결정한다.

\[\begin{equation} {\arg\!\min}_{\mathbf{w}} \frac{\mathbf{w}^\top ( \boldsymbol\mu_1 - \boldsymbol\mu_2 )}{\mathbf{w}^\top \boldsymbol\Sigma \mathbf{w}} \tag{2.2} \end{equation}\]

위 식 (2.2)의 해는

\[\begin{equation*} \mathbf{w} \propto \boldsymbol\Sigma^{-1}(\boldsymbol\mu_1 - \boldsymbol\mu_2) \end{equation*}\]

의 조건을 만족하며, 편의상 비례상수를 1로 두면 아래와 같은 해가 얻어진다.

\[\begin{equation} \mathbf{w} = \boldsymbol\Sigma^{-1}(\boldsymbol\mu_1 - \boldsymbol\mu_2) \tag{2.3} \end{equation}\]

실제 모집단의 평균 및 분산을 알지 못하는 경우, 학습표본으로부터 \(\boldsymbol\mu_1, \boldsymbol\mu_2, \boldsymbol\Sigma\)의 추정치를 얻어 식 (2.3)에 대입하는 방식으로 판별계수를 추정한다. 자세한 내용은 교재 (전치혁 2012) 참조.

Table 1.1에 주어진 학습표본을 이용하여 피셔 판별함수를 구해보도록 하자. 우선 각 범주별 관측수 \(n_1, n_2\), 평균벡터 \(\hat{\boldsymbol\mu}_1, \hat{\boldsymbol\mu}_2\), 그리고 범주별 표본 분산-공분산행렬 \(\mathbf{S}_1, \mathbf{S}_2\)를 다음과 같이 함수 group_summary()를 사용하여 구한다.

summary_within_group <- group_summary(binaryclass2, class, c(x1, x2))
print(summary_within_group)
#> [[1]]
#> [[1]]$n
#> [1] 4
#> 
#> [[1]]$mean
#> x1 x2 
#>  4  6 
#> 
#> [[1]]$sigma
#>          x1        x2
#> x1 3.333333 1.0000000
#> x2 1.000000 0.6666667
#> 
#> 
#> [[2]]
#> [[2]]$n
#> [1] 5
#> 
#> [[2]]$mean
#>  x1  x2 
#> 6.6 5.4 
#> 
#> [[2]]$sigma
#>      x1   x2
#> x1 4.30 2.95
#> x2 2.95 3.80
#> 
#> 
#> attr(,"group")
#> [1] 1 2
#> Levels: 1 2

위에서 얻은 결과를 이용하여 합동 분산-공분산행렬을 아래와 같이 추정한다.

\[\begin{equation*} \hat{\boldsymbol\Sigma} = \mathbf{S}_p = \frac{(n_1 - 1)\mathbf{S}_1 + (n_2 - 1)\mathbf{S}_2}{n_1 + n_2 - 2} \end{equation*}\]

sigma_hat <- pooled_variance(binaryclass2, class, c(x1, x2))
print(sigma_hat)
#>          x1       x2
#> x1 3.885714 2.114286
#> x2 2.114286 2.457143

위에서 구한 추정치들을 이용하여 아래와 같이 판별함수 계수 추정치 \(\hat{\mathbf{w}}\)를 구한다. \[\begin{equation*} \hat{\mathbf{w}} = \hat{\boldsymbol\Sigma}^{-1}(\hat{\boldsymbol\mu}_1 - \hat{\boldsymbol\mu}_2) \end{equation*}\]

mu_hat <- map(summary_within_group, ~ .[["mean"]])
w_hat <- solve(sigma_hat) %*% 
  as.matrix(mu_hat[[1]] - mu_hat[[2]], ncol = 1L) %>%
  as.vector()
print(w_hat)
#> [1] -1.508039  1.541801

위 피셔 판별함수를 구하는 과정을 하나의 함수 fisher_ld()로 구현하였다.

w_hat <- fisher_ld(binaryclass2, class, c(x1, x2))
print(w_hat)
#>        x1        x2 
#> -1.508039  1.541801 
#> attr(,"group")
#> [1] 1 2
#> Levels: 1 2

2.1 분류 규칙

피셔 판별함수에 따른 분류 경계값은 학습표본에 대한 판별함수값의 평균으로 아래와 같이 구할 수 있다.

\[\begin{equation*} \bar{z} = \frac{1}{N} \sum_i^N \hat{\mathbf{w}}^\top \mathbf{x}_i \end{equation*}\]

z_mean <- binaryclass2 %>% 
  select(x1, x2) %>% 
  summarize_all(mean) %>%
  `*`(w_hat) %>%
  sum()
print(z_mean)
#> [1] 0.526438

위 분류 경계값을 구하는 과정을 함수 fisher_ld_threshold()로 구현하였다.

z_mean <- fisher_ld_threshold(binaryclass2, class, c(x1, x2))
print(z_mean)
#> [1] 0.526438

위 결과를 통해, 분류규칙은 다음과 같이 주어진다.

\(\hat{\mathbf{w}}^\top \mathbf{x} \ge \bar{z}\) 이면, \(\mathbf{x}\)를 범주 1로 분류
\(\hat{\mathbf{w}}^\top \mathbf{x} < \bar{z}\) 이면, \(\mathbf{x}\)를 범주 2로 분류

class_level <- attr(summary_within_group, "group")

prediction_df <- binaryclass2 %>%
  mutate(
    z = w_hat[1] * x1 + w_hat[2] * x2,
    .pred_class = factor(
      if_else(z >= z_mean, class_level[1], class_level[2]), 
      levels = class_level
    )
  )

위 분류 과정을 함수 fisher_ld_prediction()을 이용하여 아래와 같이 수행할 수 있다.

prediction_df <- fisher_ld_prediction(
  .w = w_hat, 
  .z = z_mean, 
  .newdata = binaryclass2, 
  .xvar = c(x1, x2),
  .levels = attr(w_hat, "group")
)

분류 결과는 아래와 같다.

Table 2.1: 학습표본에 대한 피셔 분류 결과
id	x1	x2	class	z	.pred_class
1	5	7	1	3.2524116	1
2	4	3	2	-1.4067524	2
3	7	8	2	1.7781350	1
4	8	6	2	-2.8135048	2
5	3	6	1	4.7266881	1
6	2	5	1	4.6929260	1
7	6	6	1	0.2025723	2
8	9	6	2	-4.3215434	2
9	5	4	2	-1.3729904	2

위 결과 객체 3, 7가 오분류된다.

3 의사결정론에 의한 분류규칙

다음과 같이 객체가 각 범주에 속할 사전확률과 각 범주 내에서의 분류변수의 확률밀도함수에 대한 기호를 정의한다.

\(\pi_k\): 임의의 객체가 범주 \(k\)에 속할 사전확률
\(f_k(\mathbf{x})\): 범주 \(k\)에 대한 변수의 확률밀도함수

이 때 통상적으로 \(\mathbf{x}\)는 다변량 정규분포를 따르는 것으로 가정하여 아래와 같이 평균벡터 \(\boldsymbol\mu_k\)와 분산-공분산행렬 \(\boldsymbol\Sigma\)로 확률밀도함수를 정의할 수 있다. 이 때 분산-공분산행렬 \(\boldsymbol\Sigma\)는 모든 범주에 대해 동일하다고 가정한다.

\[\begin{equation} f_k(\mathbf{x}) = \frac{1}{(2\pi)^{p/2}|\boldsymbol\Sigma|^{1/2}} \exp \{ -\frac{1}{2} \left(\mathbf{x} - \boldsymbol\mu_k\right)^\top \boldsymbol\Sigma^{-1} \left(\mathbf{x} - \boldsymbol\mu_k\right) \} \tag{3.1} \end{equation}\]

각 범주에 대한 판별함수를

\[\begin{equation} u_k(\mathbf{x}) = \boldsymbol\mu_k^\top \boldsymbol\Sigma^{-1}\mathbf{x} - \frac{1}{2} \boldsymbol\mu_k^\top \boldsymbol\Sigma^{-1} \boldsymbol\mu_k + \ln \pi_k \tag{3.2} \end{equation}\]

로 정의한다. 위 판별함수를 함수 ld_fun()로 구현하였다.

f <- ld_fun(binaryclass2, class, c(x1, x2))

위 함수를 이용하여 학습표본에 대해 판별함수값을 다음과 같이 얻을 수 있다.

u_df <- map_dfc(f, 
  ~ binaryclass2 %>% 
    select(x1, x2) %>%
    transmute(u = apply(., 1, .x))
)
#> New names:
#> * u -> u...1
#> * u -> u...2
names(u_df) <- str_c(".score", attr(f, "group"), sep = "_")
score_df <- bind_cols(binaryclass2, u_df)

Table 3.1: 학습표본에 대한 LDA 적용 결과: 판별함수값
id	x1	x2	class	.score_1	.score_2
1	5	7	1	9.21	6.97
2	4	3	2	-1.94	0.49
3	7	8	2	11.01	10.25
4	8	6	2	4.59	8.42
5	3	6	1	7.40	3.70
6	2	5	1	5.04	1.37
7	6	6	1	5.72	6.53
8	9	6	2	4.03	9.37
9	5	4	2	0.43	2.82

베이즈 정리(Bayes’s theorem)에 따라 변수 \(\mathbf{x}\)값이 주어졌을 때 범주 \(k\)에 속할 사후확률(posterior)은 아래와 같이 구할 수 있다.

\[\begin{equation} P(y = k \, | \, \mathbf{x}) = \frac{\pi_k f_k(\mathbf{x})}{f(\mathbf{x})} \tag{3.3} \end{equation}\]

p_df <- u_df %>%
  mutate_all(exp) %>%
  mutate_all(function(x) x / rowSums(.))

names(p_df) <- str_c(".pred", attr(f, "group"), sep = "_")
posterior_df <- bind_cols(binaryclass2, p_df)

Table 3.2: 학습표본에 대한 LDA 적용 결과: 사후확률값
id	x1	x2	class	.pred_1	.pred_2
1	5	7	1	0.90	0.10
2	4	3	2	0.08	0.92
3	7	8	2	0.68	0.32
4	8	6	2	0.02	0.98
5	3	6	1	0.98	0.02
6	2	5	1	0.98	0.02
7	6	6	1	0.31	0.69
8	9	6	2	0.00	1.00
9	5	4	2	0.08	0.92

최종적으로, 각 범주에 대한 사후확률이 높은 쪽으로 범주를 추정한다.

\[\begin{equation} \hat{y} = \begin{cases} 1, & \text{if } P(y = 1 \, | \, \mathbf{x}) \ge P(y = 2 \, | \, \mathbf{x})\\ 2, & \text{otherwise} \end{cases} \tag{3.4} \end{equation}\]

yhat_df <- tibble(
  .pred_class = attr(f, "group")[apply(p_df, 1, which.max)]
)

prediction_df <- bind_cols(posterior_df, yhat_df)

Table 3.3: 학습표본에 대한 LDA 적용 결과: 사후확률값 및 추정범주
id	x1	x2	class	.pred_1	.pred_2	.pred_class
1	5	7	1	0.90	0.10	1
2	4	3	2	0.08	0.92	2
3	7	8	2	0.68	0.32	1
4	8	6	2	0.02	0.98	2
5	3	6	1	0.98	0.02	1
6	2	5	1	0.98	0.02	1
7	6	6	1	0.31	0.69	2
8	9	6	2	0.00	1.00	2
9	5	4	2	0.08	0.92	2

위와 같이 새 데이터에 대해 판별함수값, 사후확률 및 범주추정을 수행하는 함수를 predict_da()로 구현하였다.

predict_da(f, .new_data = binaryclass2, c(x1, x2), 
  .include_score = TRUE,
  .include_posterior = TRUE,
  .include_class = TRUE)
#> # A tibble: 9 × 5
#>   .score_1 .score_2 .pred_1 .pred_2 .pred_class
#>      <dbl>    <dbl>   <dbl>   <dbl> <fct>      
#> 1    9.21     6.97  0.903    0.0968 1          
#> 2   -1.94     0.489 0.0812   0.919  2          
#> 3   11.0     10.2   0.681    0.319  1          
#> 4    4.59     8.42  0.0212   0.979  2          
#> 5    7.40     3.70  0.976    0.0239 1          
#> 6    5.04     1.37  0.975    0.0247 1          
#> 7    5.72     6.53  0.307    0.693  2          
#> 8    4.03     9.37  0.00477  0.995  2          
#> 9    0.427    2.82  0.0838   0.916  2

4 이차 판별함수

각 범주의 확률밀도함수는 아래와 같이 다변량 정규분포로 정의된다.

\[\begin{equation} f_k(\mathbf{x}) = \frac{1}{(2\pi)^{p/2}|\boldsymbol\Sigma_k|^{1/2}} \exp \{ -\frac{1}{2} \left(\mathbf{x} - \boldsymbol\mu_k\right)^\top \boldsymbol\Sigma_k^{-1} \left(\mathbf{x} - \boldsymbol\mu_k\right) \} \tag{4.1} \end{equation}\]

위 식 (4.1)이 선형판별함수에서 사용한 식 (3.1)과 다른 부분은 분산-공분산분포 \(\boldsymbol\Sigma_k\)가 범주 \(k\)에 대해 각각 정의된다는 점이다.

이 경우 각 범주에 대한 판별함수는 아래와 같이 정의된다.

\[\begin{equation} u_k(\mathbf{x}) = - \frac{1}{2} (\mathbf{x} - \boldsymbol\mu_k)^\top \boldsymbol\Sigma_k^{-1} (\mathbf{x} - \boldsymbol\mu_k) - \frac{1}{2} \ln \left| \boldsymbol\Sigma_k \right| + \ln \pi_k \tag{4.2} \end{equation}\]

위 판별함수를 데이터로부터 얻는 함수를 qd_fun()로 구현하였다.

f <- qd_fun(binaryclass2, class, c(x1, x2))

이후, 선형판별함수에서와 같이 함수 predict_df()를 이용하여 판별함수값, 사후확률 및 범주추정값을 얻을 수 있다.

prediction_df <- predict_da(f, .new_data = binaryclass2, c(x1, x2), 
  .include_score = TRUE,
  .include_posterior = TRUE,
  .include_class = TRUE)

prediction_df <- bind_cols(binaryclass2, prediction_df)

Table 4.1: 학습표본에 대한 QDA 적용 결과: 판별함수값, 사후확률 및 추정범주
id	x1	x2	class	.score_1	.score_2	.pred_1	.pred_2	.pred_class
1	5	7	1	-1.73	-3.95	0.90	0.10	1
2	4	3	2	-13.18	-2.50	0.00	1.00	2
3	7	8	2	-3.91	-3.15	0.32	0.68	2
4	8	6	2	-5.27	-1.87	0.03	0.97	2
5	3	6	1	-1.18	-5.76	0.99	0.01	1
6	2	5	1	-1.73	-6.20	0.99	0.01	1
7	6	6	1	-2.00	-1.93	0.48	0.52	2
8	9	6	2	-7.73	-2.58	0.01	0.99	2
9	5	4	2	-8.27	-1.93	0.00	1.00	2

5 세 범주 이상의 분류

5.1 데이터

3개의 범주를 지닌 붓꽃(iris) 데이터에 대해 판별분석을 수행하고자 한다.

data(iris90, package = "dmtr")

5.2 선형 판별 분석

함수 ld_fun()을 이용하여 선형 판별함수를 구하고, predict_da()를 이용하여 판별함수 값을 구하고 범주를 추정한다.

f <- ld_fun(iris90, class, x1:x4)

new_data <- tibble::tribble(
  ~x1, ~x2, ~x3, ~x4, ~class,
  5.1, 3.5, 1.4, 0.2, "setosa",
  7.0, 3.2, 4.7, 1.4, "versicolor",
  6.3, 3.3, 6.0, 2.5, "virginica"
) %>% 
  mutate(
    class = factor(class, levels = c("setosa", "versicolor", "virginica"))
  )

predict_df <- predict_da(f, .new_data = new_data, .xvar = x1:x4, 
  .include_score = TRUE)

predict_df <- bind_cols(new_data, predict_df)

Table 5.1: 붗꽃 표본에 대한 LDA 적용 결과: 판별함수값 및 추정범주
x1	x2	x3	x4	class	.score_setosa	.score_versicolor	.score_virginica	.pred_class
5.1	3.5	1.4	0.2	setosa	69.14	21.16	-23.36	setosa
7.0	3.2	4.7	1.4	versicolor	28.09	68.82	60.17	versicolor
6.3	3.3	6.0	2.5	virginica	-2.22	90.70	107.00	virginica

전치혁. 2012. 데이터마이닝 기법과 응용. 한나래출판사.