8 - CMB 1302页

Association Rules for Diagnosis of Hiv-Aids

30

summarized in a contingency table, as shown in table

1. From the table, it is illustrated that the probability

of HIV infected patients are P ({HIV}) = 0:60, the

probability of tuberculosis infected patients are

P({Tuberculosis}) = 0:75, and the probability of

infecting both is P(HIV,Tuberculosis}) = 0.40. By

Equation (6) and (7) is P(HIV, Tuberculosis) =

(P({HIV}) x P({Tuberculosis})) = 0.40= (0.60 x 0.75)

= 0.89. Because this value is less than 1, there is a

negative correlation between the occurrence of {HIV}

and {Tuberculosis}. The numerator is the likelihood

of a customer purchasing both, while the denominator

is what the likelihood would have been if the two

purchases were completely independent. Such a

negative correlation cannot be identified by a support

confidence framework.

Table 1 A 2×2 contingency table summarizing the transactions

with respect to HIV and Tuberculosis infection

HIV

-----

H

-----

I

----

V ΣROW

Tuberculosis

4000

3500

7500

Tuberculosis

2000

500

2500

Σcol

6000

4000

10000

Example II Correlation analysis using chi square

measure: To compute the correlation using χ

2

analysis,

we need the observed value and expected value

(displayed in parenthesis) for each slot of the

contingency table, as shown in Table 2 from the table

we compute the χ

2

value as follows:

χ

2

=(4000-4500)

2

/4500+(3500-3000)

2

/3000+(2500-150

0)

2

/1500+(500-1000)

2

/1000=555.6

Table 2 The above contingency table, now shown with the

expected values

HIV

HIV

row



Tuberculosis

4000 (4500)

3500 (3000) 7500

Tuberculosis

2000 (1500)

500 (1000)

2500

col



6000

4000

10000

Because the χ

2

value is greater than one, and the

observed value of the slot (HIV, Tuberculosis) =4000,

which is less than the expected value 4500, having

HIV and Tuberculosis are negatively correlated. This

is consistent with the conclusion derived from the

analysis of the Example 1 and 2.

Let’s examine two other correlation measures, all

confidence and cosine, as defined below.

Given an itemset x={i

1

, i

2

......i

k

} the all _confidence of

X is defined as:

all_conf(x)= sup(x)/max_item_sup(x) =sup(x)/max

{sup(i

j

)/

∀

i

j

∈

x}

(8)

Where max{sup(i

j

)/

∀

i

j

∈

x} is the maximum (single)

item support of all the items in x, and hence is called

the max_item_sup of the itemset x. The all confidence

of x is the minimal confidence among the set of rules

i

j

→x→i

j

, where i

j

∈

x.

Given two itemsets A and B, the cosine measure of A

and B is defined as:

cosine

(A,

B)=P(AUB)/√(P(A)×P(B)

=

sup(AUB)/√sup(A)×sup(B)

(9)

The cosine measure can be viewed as a harmonized

HIV measure: the two formulae are similar except that

for cosine, the square root is taken on the product of

the probabilities of A and B. This is an important

difference however because by taking the square root

the cosine value is only influenced by the supports of

A, B, and AUB, and not by the total number of

patients.

Lift and chi-square are poor indicators of the other

relationships, whereas all_confidence and cosine are

good indicators. In between all confidence and cosine

are good indicators. This is because cosine considers

the supports of both A and B, whereas all_confidence

considers only the maximal support. The lift and

chi-square are poor correlations because we do not

consider null patients data. A null patient data that

does not contain any of the disease patient data being

examined. Typically, the number of null-transactions

can outweigh the number of individual infected with

diseases, because many patients may neither infected

with HIV nor Tuberculosis. On the other hand, all

confidence and cosine values are good indicators of

correlation because they are not influenced by the

number of null patient data. A measure is

null-invariant if its value is free from the influence of

null-data. Null-invariance is an important property for

Computational

Molecular Biology