7 - CMB 1302页

Computational Molecular Biology

29

(ii). Generate strong association rules from the

frequent itemsets: By definition, these rules must

satisfy minimum support and minimum confidence.

2.3 From association analysis to correlation

analysis

Correlation relationships between associated items:

Most association rule mining algorithms employ a

support-confidence framework.

Often,

many

interesting rules can be found using low support

thresholds.

Although minimum support

and

confidence thresholds help weed out or exclude the

exploration. It is said that the support and confidence

measures are insufficient at filtering out uninteresting

association rules. To tackle this weakness, a

correlation measure can be used to augment the

support-confidence framework for association rules.

This leads to correlation rules of the form,

A

⇒

B [support, confidence, correlation]

(5)

That is, a correlation rule is measured not only by its

support and confidence but also by the correlation

between itemsets A and B. There are many different

correlation measures from which to choose. Let us

choose a simple example. HIV is a simple correlation

measure that is given as follows. The occurrence of

itemset A is independent of the occurrence of itemset

B if P(AUB) = P(A)P(B); otherwise, item sets A and B

are dependent and correlated as events. This definition

can easily be extended to more than two itemsets. The

HIV between the occurrence of A and B can be

measured by computing (lift equation):

HIV(A, B) = P(AUB)/P(A)(B)

(6)

If the resulting value of Equation (6) is less than 1,

then the occurrence of A is negatively correlated with

the occurrence of B. If the resulting value is greater

than 1, then A and B are positively correlated,

meaning that the occurrence of one implies the

occurrence of the other. If the resulting value is equal

to 1, then A and B are independent and there is no

correlation between them. Equation (6) is equivalent

to P(B/A)/P(B), or conf(A

⇒

B)/sup(B), which is also

referred as the HIV of the association (or correlation)

rule A

⇒

B. In other words, it assesses the degree to

which the occurrence of one “HIV” the occurrence of

the other tuberculosis. For example, if A corresponds

to the HIV and B corresponds to the tuberculosis, then

given the current relation of HIV with tuberculosis,

which can be measured by 5.23. As it was studied that

HIV infected patients also suffer from tuberculosis.

Example 1 Correlation analysis: To help filter out

misleading “strong” associations of the form A

⇒

B

from the data, we need to study how the two itemsets,

A and B, are correlated. Let we have the data of any

hospital. Of the 10 000 patients analyzed the data

showed that 6000 of the patients with HIV positive,

while 7 500 were infected with tuberculosis and 4 000

included patients are infected with HIV and

tuberculosis. Suppose that a data mining program for

discovering association rules is run on the data, using

a minimum support of say 30% and a minimum

confidence of 60%. The following association rule is

discovered:

Disease (virus,

HIV)

⇒

buys (X,

”Other”)

[support=40%, confidence=66%]

(7)

Rule (7) is a strong association rule and would

therefore be reported, since its support value of

4000/10,000= 40% and confidence value of

4000/6000=66% satisfy the minimum support and

minimum confidence thresholds,

respectively.

However, rule 5.21 is misleading because the

probability of purchasing videos is 75%, which is

even larger than 66%. In fact, virus and HIV are

negatively associated because the infection of one of

these diseases actually decreases the likelihood of

infection of other.

The above example also illustrates that the confidence

of rule A

⇒

B can be deceiving in that it is only an

estimate of the conditional probability of itemset B

given itemset A. It does not measure the real strength

of the correlation and implication between A and B.

Hence alternatives to the support-confidence

framework can be useful in mining data relationships.

From above example we need to study how the two

itemsets A and B are correlated. Followed this

example let the disease refer to the patients that don’t

have HIV and tuberculosis refer to those that do not

contain tuberculosis. The patients data can be

Computational

Molecular Biology