9 - CMB 1302页

基本HTML版本

Computational Molecular Biology
31
measuring correlations in large patient databases. To
prove all confidence and cosine the best at assessing
correlation in all cases. Let’s examine the HIV and
Tuberculosis examples again.
Example 2 Comparison of four correlation measures
on HIV and Tuberculosis data. We revisit example 1.
Let D
1
be the original HIV (H) and Tuberculosis (T)
data set from Table 5.7. We add two more data sets, D
0
and D
1
, where D
0
zero null patients’ data has, and D
2
has 10 000 null-patients data (instead of only 500 as in
D
1
). The values of all four correlation measures are
shown in Table 3.
Table 3 Comparison of the four correlation measures for HIV-and-Tuberculosis data sets
Dataset
HT
HT
HT
HT
All_confi
cosine
Lift
2
0
D
4000
3500
2000
0
0.53
0.60
0.84
1477.8
1
D
4000
3500
2000
500
0.53
0.60
0.89
555.6
2
D
4000
3500
2000
10000
0.53
0.60
1.73
2913.0
In Table 3, HT,
---
HT and H
---
T, remain the same in D
0
, D
1
.
However, lift and χ
2
change from rather negative to
rather positive correlations, whereas all confidence
and cosine have the nice null-invariant property, and
their values remain the same in all cases.
Unfortunately, we cannot precisely assert that a set of
items are positively or negatively correlated when the
value of all confidence or cosine is around 0.5. Strictly
based on whether the value is greater than 0.5, we will
say that H and T are positively correlated in D
1
,
however, it has been shown that they are negatively
correlated by the lift and χ
2
analysis. Therefore, a good
strategy is to perform the all confidence or cosine
analysis first, and when the result shows that they are
weakly positively/negatively correlated,
other
analyses can be performed to assist in obtaining a
more complete picture.
Besides null-invariance, another nice feature of the all
confidence measure is that it has the Apriori-like
downward closure property. That is, if a pattern is
all-confident (i.e., passing a minimal all confidence
threshold), so is every one of its sub patterns. In other
words, if a pattern is not all-confident, further growth
(or specialization) of this pattern will never satisfy the
minimal all confidence threshold. This is obvious
since according to Equation (8), adding any item into
an itemset X will never increase sup(X), never
decrease max item sup(X), and thus never increase all
con f (X). This property makes Apriori-like pruning
possible: we can prune any patterns that cannot satisfy
the minimal all confidence threshold during the
growth of all-confident patterns in mining.
Jaccard Similarity Coefficient: It is a statistical index
for measuring the similarity and variety of sample sets
(Roussinov and Zhao, 2003).
3 Result and Discussion
Tuberculosis (TB) and HIV have been closely linked
since the emergence of AIDS. Worldwide, TB is the
most common opportunistic infection affecting
HIV-seropositive individuals, and it remains the most
common cause of death in patients with AIDS
(Raviglione et al., 1995). HIV infection has
contributed to a significant increase in the worldwide
incidence of TB (AIDSCAP, 1996; Raviglione et al.,
1992). By producing a progressive decline in
cell-mediated immunity, HIV alters the pathogenesis
of TB, greatly increasing the risk of disease from TB
in HIV-co infected individuals and leading to more
frequent extra pulmonary involvement, atypical
radiographic manifestations,
and paucibacillary
disease, which can impede timely diagnosis. Although
HIV-related TB is both treatable and preventable,
incidence continues to climb in developing nations
wherein HIV infection and TB are endemic and
resources are limited. Interactions between HIV and
TB medications, overlapping medication toxicities,
and immune reconstitution inflammatory syndrome
(IRIS) complicate the co treatment of HIV and TB.
These association rules with correlation of diseases
help the bio- medical scientists and medical
practitioner for better treatment of diseases.
It was observed that the use of only support and
confidence measures to mine associations results in
Computational
Molecular Biology