Computational Molecular Biology
29
(ii). Generate strong association rules from the
frequent itemsets: By definition, these rules must
satisfy minimum support and minimum confidence.
2.3 From association analysis to correlation
analysis
Correlation relationships between associated items:
Most association rule mining algorithms employ a
support-confidence framework.
Often,
many
interesting rules can be found using low support
thresholds.
Although minimum support
and
confidence thresholds help weed out or exclude the
exploration. It is said that the support and confidence
measures are insufficient at filtering out uninteresting
association rules. To tackle this weakness, a
correlation measure can be used to augment the
support-confidence framework for association rules.
This leads to correlation rules of the form,
A
⇒
B [support, confidence, correlation]
(5)
That is, a correlation rule is measured not only by its
support and confidence but also by the correlation
between itemsets A and B. There are many different
correlation measures from which to choose. Let us
choose a simple example. HIV is a simple correlation
measure that is given as follows. The occurrence of
itemset A is independent of the occurrence of itemset
B if P(AUB) = P(A)P(B); otherwise, item sets A and B
are dependent and correlated as events. This definition
can easily be extended to more than two itemsets. The
HIV between the occurrence of A and B can be
measured by computing (lift equation):
HIV(A, B) = P(AUB)/P(A)(B)
(6)
If the resulting value of Equation (6) is less than 1,
then the occurrence of A is negatively correlated with
the occurrence of B. If the resulting value is greater
than 1, then A and B are positively correlated,
meaning that the occurrence of one implies the
occurrence of the other. If the resulting value is equal
to 1, then A and B are independent and there is no
correlation between them. Equation (6) is equivalent
to P(B/A)/P(B), or conf(A
⇒
B)/sup(B), which is also
referred as the HIV of the association (or correlation)
rule A
⇒
B. In other words, it assesses the degree to
which the occurrence of one “HIV” the occurrence of
the other tuberculosis. For example, if A corresponds
to the HIV and B corresponds to the tuberculosis, then
given the current relation of HIV with tuberculosis,
which can be measured by 5.23. As it was studied that
HIV infected patients also suffer from tuberculosis.
Example 1 Correlation analysis: To help filter out
misleading “strong” associations of the form A
⇒
B
from the data, we need to study how the two itemsets,
A and B, are correlated. Let we have the data of any
hospital. Of the 10 000 patients analyzed the data
showed that 6000 of the patients with HIV positive,
while 7 500 were infected with tuberculosis and 4 000
included patients are infected with HIV and
tuberculosis. Suppose that a data mining program for
discovering association rules is run on the data, using
a minimum support of say 30% and a minimum
confidence of 60%. The following association rule is
discovered:
Disease (virus,
HIV)
⇒
buys (X,
”Other”)
[support=40%, confidence=66%]
(7)
Rule (7) is a strong association rule and would
therefore be reported, since its support value of
4000/10,000= 40% and confidence value of
4000/6000=66% satisfy the minimum support and
minimum confidence thresholds,
respectively.
However, rule 5.21 is misleading because the
probability of purchasing videos is 75%, which is
even larger than 66%. In fact, virus and HIV are
negatively associated because the infection of one of
these diseases actually decreases the likelihood of
infection of other.
The above example also illustrates that the confidence
of rule A
⇒
B can be deceiving in that it is only an
estimate of the conditional probability of itemset B
given itemset A. It does not measure the real strength
of the correlation and implication between A and B.
Hence alternatives to the support-confidence
framework can be useful in mining data relationships.
From above example we need to study how the two
itemsets A and B are correlated. Followed this
example let the disease refer to the patients that don’t
have HIV and tuberculosis refer to those that do not
contain tuberculosis. The patients data can be
Computational
Molecular Biology