Apriori Algorithm

1. Apriori algorithm is a classical algorithm in data mining. It is used for mining frequent itemsets and relevant association rules.

2. It is devised to operate on a database containing a lot of transactions, for instance, items brought by customers in a store.

3. This algorithm, introduced by R Agrawal and R Srikant in 1994 has great significance in data mining.

4. Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset properties.

5. We apply an iterative approach or level-wise search where k-frequent itemsets are used to find k+1 itemsets.


Apriori Property –“All non-empty subset of frequent itemset must be frequent.”


Three significant components comprise the apriori algorithm. They are as follows.

  • Support
  • Confidence
  • Lift

Consider the following dataset and we will find frequent item sets and generate association rules for them.



minimum support count is 0.2 ie (0.2*No Transactions=0.2*9=1.8 ie 2 Minimum Transactions)
minimum confidence is 60%

Step-1: K=1

(I) Create a table containing support count of each item present in dataset – Called C1(candidate
set)set



(II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support then remove
those items). This gives us itemset L1.


Step-2: K=2

Generate candidate set C2 using L1 (this is called join step). Condition of joining Lk-1and
Lk-1 is that it should have (K-2) elements in common.

Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each
itemset)
Now find support count of these itemsets by searching in dataset.


(II) compare candidate (C2) support count with minimum support count(here min_support=2
if support_count of candidate set item is less than min_support then remove those items) this
gives us itemset L2.



Step-3:

Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is
that it should have (K-2) elements in common. So here, for L2, first element
should match.

So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2,
I4, I5}{I2, I3, I5}

Check if all subsets of these itemsets are frequent or not and if not, then remove
that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are
frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly
check for every itemset)

find support count of these remaining itemset by searching in dataset.


(II) Compare candidate (C3) support count with minimum support count(here min_support=2
if support_count of candidate set item is less than min_support then remove those items) this
gives us itemset L3.


Step-4:

Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-
1(K=4) is that, they should have (K-2) elements in common. So here, for L3, first

2 elements (items) should match.

Check all subsets of these itemsets are frequent or not (Here itemset formed by
joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not
frequent). So no itemset in C4

We stop here because no frequent itemsets are found further

Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of each
rule.

  • Confidence
A confidence of 60% means that 60% of the customers, who purchased milk and bread
also bought butter.

Confidence(A->B)=Support_count(A∪B)/Support_count(A)

So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
So rules can be,

[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%

[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%

So if minimum confidence is 50%, then first 3 rules can be considered as strong
association rules.

  • The entire algorithm can be divided into two steps:

Step 1: Apply minimum support to find all the frequent sets with k items in a database.

Step 2: Use the self-join rule to find the frequent sets with k+1 items with the help of
frequent k-itemsets. Repeat this process from k=1 to the point when we are unable to
apply the self-join rule.

This approach of extending a frequent itemset one at a time is called the “bottom up”
approach.









  • Advantages:
  1.  Easy to understand and implement
  2.  Can use on large itemsets

  • Disadvantages
  1.  At times, you need a large number of candidate rules. It can become computationally expensive.
  2.  It is also an expensive method to calculate support because the calculation has to gothrough the entire database.
  • Limitations
  1.  The process can sometimes be very tedious.

How to Improve the Efficiency of the Apriori Algorithm?

    Use the following methods to improve the efficiency of the apriori algorithm.
  1. Transaction Reduction – A transaction not containing any frequent k-itemset becomesuseless in subsequent scans.
  2. Hash-based Itemset Counting – Exclude the k-itemset whose corresponding hashing bucket count is less than the threshold is an infrequent itemset.







Previous Post Next Post

Cybersecurity - How to stay safe in digital world

Contact Form