8 - CMB 1302页

基本HTML版本

Association Rules for Diagnosis of Hiv-Aids
30
summarized in a contingency table, as shown in table
1. From the table, it is illustrated that the probability
of HIV infected patients are P ({HIV}) = 0:60, the
probability of tuberculosis infected patients are
P({Tuberculosis}) = 0:75, and the probability of
infecting both is P(HIV,Tuberculosis}) = 0.40. By
Equation (6) and (7) is P(HIV, Tuberculosis) =
(P({HIV}) x P({Tuberculosis})) = 0.40= (0.60 x 0.75)
= 0.89. Because this value is less than 1, there is a
negative correlation between the occurrence of {HIV}
and {Tuberculosis}. The numerator is the likelihood
of a customer purchasing both, while the denominator
is what the likelihood would have been if the two
purchases were completely independent. Such a
negative correlation cannot be identified by a support
confidence framework.
Table 1 A 2×2 contingency table summarizing the transactions
with respect to HIV and Tuberculosis infection
HIV
-----
H
-----
I
----
V ΣROW
Tuberculosis
4000
3500
7500
Tuberculosis
2000
500
2500
Σcol
6000
4000
10000
Example II Correlation analysis using chi square
measure: To compute the correlation using χ
2
analysis,
we need the observed value and expected value
(displayed in parenthesis) for each slot of the
contingency table, as shown in Table 2 from the table
we compute the χ
2
value as follows:
χ
2
=(4000-4500)
2
/4500+(3500-3000)
2
/3000+(2500-150
0)
2
/1500+(500-1000)
2
/1000=555.6
Table 2 The above contingency table, now shown with the
expected values
HIV
HIV
row
Tuberculosis
4000 (4500)
3500 (3000) 7500
Tuberculosis
2000 (1500)
500 (1000)
2500
col
6000
4000
10000
Because the χ
2
value is greater than one, and the
observed value of the slot (HIV, Tuberculosis) =4000,
which is less than the expected value 4500, having
HIV and Tuberculosis are negatively correlated. This
is consistent with the conclusion derived from the
analysis of the Example 1 and 2.
Let’s examine two other correlation measures, all
confidence and cosine, as defined below.
Given an itemset x={i
1
, i
2
......i
k
} the all _confidence of
X is defined as:
all_conf(x)= sup(x)/max_item_sup(x) =sup(x)/max
{sup(i
j
)/
i
j
x}
(8)
Where max{sup(i
j
)/
i
j
x} is the maximum (single)
item support of all the items in x, and hence is called
the max_item_sup of the itemset x. The all confidence
of x is the minimal confidence among the set of rules
i
j
→x→i
j
, where i
j
x.
Given two itemsets A and B, the cosine measure of A
and B is defined as:
cosine
(A,
B)=P(AUB)/√(P(A)×P(B)
=
sup(AUB)/√sup(A)×sup(B)
(9)
The cosine measure can be viewed as a harmonized
HIV measure: the two formulae are similar except that
for cosine, the square root is taken on the product of
the probabilities of A and B. This is an important
difference however because by taking the square root
the cosine value is only influenced by the supports of
A, B, and AUB, and not by the total number of
patients.
Lift and chi-square are poor indicators of the other
relationships, whereas all_confidence and cosine are
good indicators. In between all confidence and cosine
are good indicators. This is because cosine considers
the supports of both A and B, whereas all_confidence
considers only the maximal support. The lift and
chi-square are poor correlations because we do not
consider null patients data. A null patient data that
does not contain any of the disease patient data being
examined. Typically, the number of null-transactions
can outweigh the number of individual infected with
diseases, because many patients may neither infected
with HIV nor Tuberculosis. On the other hand, all
confidence and cosine values are good indicators of
correlation because they are not influenced by the
number of null patient data. A measure is
null-invariant if its value is free from the influence of
null-data. Null-invariance is an important property for
Computational
Molecular Biology