6 - CMB 1302页

基本HTML版本

Association Rules for Diagnosis of Hiv-Aids
28
represented by a Boolean vector of values (checking
of particular virus infection) assigned to these
variables. The Boolean vectors can be analyzed for
symptoms shown by patients of viruses that reflect
particular symptoms that are frequently associated
together to diagnose the particular virus infection.
These patterns can be represented in the form of
association rules. For example, the information that
patients told to medical practitioners also tend to test
for certain tests based on symptoms. By this the
association rule is generated as: In association rule,
support and confidence are two measures of rule
interestingness.
They respectively reflect the
usefulness and certainty of discovered rules. A support
of 2% for Association Rule (1) means that 2% of all
the symptoms under analysis show that patient is
suffering from disease which might be infected by
certain viruses. A confidence of 60% means that 60%
of the patients who checked specific virus test have
HIV-AIDS. Typically, association rules are considered
interesting if they satisfy both a minimum support
threshold and a minimum confidence threshold. Such
thresholds can be set by users or domain experts.
Additional analysis can be performed to uncover
interesting statistical correlations between associated
items.
In case of HIV, association rule 1 can be written as:
Virus → Disease [support = 2%, confidence = 60%] (1)
A support of 2% for association rule (1) means that
2% of all viruses undergo analyses shows that it might
be HIV which causes AIDS. A confidence of 60%
means that 60% of the patients who recommended for
virus analyses test might be test for HIV which causes
AIDS.
Let HIV=I, HIV={I
1
, I
2
,.....I
m
} be a set of symptoms.
Let D, be HIV stages which has set of symptoms
according to each stage. Where each symptom T is a
set of symptoms such as T
I. Each symptom is
associated with a stage called TID. Let A be a set of
symptoms. A symptom T is said to contain A if and
only if A
T. An association rule is an implication of
the form A
B, where A
I, B
I, and A∩B=Ø. The
rule A
B holds in the stage set D with supports,
where s is the percentages of stages in D that contain
AUB (i.e. the union of set A and set B). This is taken
to be the probability, P(AUB). The rule A
B has
confidence c in the stage set D, where c is the
percentages in D containing A that also contain B.
This is taken to be conditional probability, P(B/A).
That is,
support (A
B) = P(AUB)
(2)
confidence (A
B) = P(BLA)
(3)
From equation 3 we have:
confidence (A
B) = P(B/A) = support (AUB)/support
(A) =support_count (AUB)/support_count (A)
(4)
The equation 4 shows that confidence of rule A
B
can be easily derived from the support counts of A and
AUB. That is, once the support count of A, B, AUB
are found, it derives corresponding association rules
A
B and B
A and check whether they are strong.
Rules that satisfy both a minimum support threshold
(min_sup) and a minimum confidence threshold
(min_conf) are called strong. By convention it might
be written as support and confidence values so as to
occur between 0% and 100%, rather than 0 to 1.0
(Han and Kamber, 2011).
2.2 Frequent itemsets for association rule mining
A set of items is referred to as an itemset. An itemset
that contains k items is a k-itemset. The set {virus,
disease} is a 2-itemset. The occurrence frequency of
an itemset is the number of transactions that contain
the itemset. This is also known, simply, as the
frequency, support count, or count of the itemset. Note
that the itemset support defined in Equation (4) is
sometimes referred to as relative support, whereas the
occurrence frequency is called the absolute support. If
the relative support of an itemset I satisfies a
prespecified minimum support threshold (i.e., the
absolute support of I satisfies the corresponding
minimum support count threshold), then I is a frequent
itemset. In general, association rule mining can be
viewed as a two-step process:
(i). find all frequent itemsets: By definition, each of
these itemsets will occur at least as frequently as a
predetermined minimum support count, min sup.
Computational
Molecular Biology