Case Study – Apriori algorithm
Apriori algorithm is an Association Rule Mining (ARM) algorithm for boolean association rules. The algorithm is based on the fact that it uses prior knowledge of the frequent itemset property which states that all nonempty subsets of a frequent itemset must also be frequent. This algorithm uses two functions namely candidate generation and pruning at every iteration.
In general, the association rule is an expression of the form X⇒Y, where X, Y ⊆ I. Here, X is called the antecedent and Y is called the consequent. Association rule shows how many times Y has occurred if X has already occurred depending on the minimum support (s) and minimum confidence (c) values.
ARM Measures
Support: The support of the rule X⇒Y in the transaction database D is the support of the itemset X ∪ Y in D:
support(X⇒Y) = count(X ∪ Y) / N –––> (1)
where ‘N’ is the total number of transactions in the database and count(X ∪ Y) is the number of transactions that contain X ∪ Y.
Confidence: The confidence of the rule X⇒Y in the transaction database D is the ratio of the number of transactions in D that contain X ∪ Y to the number of transactions that contain X in D:
confidence(X⇒Y) = count(X ∪ Y) / count(X) = support(X ∪ Y) / support(X) –––> (2)
It is basically denotes a conditional probability P(Y|X).
Lift: The lift of the rule X⇒Y is referred to as the interestingness measure, takes this into account by incorporating the prior probability of the rule consequent as follows:
lift(X⇒Y) = support(X ∪ Y) / support(X) ∗ support(Y) –––> (3)
The measure ‘lift‘ is newly added in this context. Its significance in ARM is given below:
- lift(X⇒Y) = 1 means that there is no correlation between X and Y,
- lift(X⇒Y) > 1 means that there is a positive correlation between X and Y, and
- lift(X⇒Y) < 1 means that there is a negative correlation between X and Y.
Greater lift value indicates stronger association. We will use this measure in our experiment.
Dataset Description
The following dataset (transaction.csv) contains transactional records of a departmental store on a particular day. The dataset is having 30 records and contains six items such as Juice, Chips, Bread, Butter, Milk, and Banana. The snapshot of the dataset is given below using MS Excel software.
transaction.csv
Juice | Chips | Bread | Butter | Milk | Banana |
Juice | Bread | Butter | Milk | ||
Bread | Butter | Milk | |||
Chips | Banana | ||||
Juice | Chips | Bread | Butter | Milk | Banana |
Juice | Chips | Milk | |||
Juice | Chips | Bread | Butter | Banana | |
Juice | Chips | Milk | |||
Juice | Bread | Banana | |||
Juice | Bread | Butter | Milk | ||
Chips | Bread | Butter | Banana | ||
Juice | Butter | Milk | Banana | ||
Juice | Chips | Bread | Butter | Milk | |
Juice | Bread | Butter | Milk | Banana | |
Juice | Bread | Butter | Milk | Banana | |
Juice | Chips | Bread | Butter | Milk | Banana |
Chips | Bread | Butter | Milk | Banana | |
Chips | Butter | Milk | Banana | ||
Juice | Chips | Bread | Butter | Milk | Banana |
Juice | Bread | Butter | Milk | Banana | |
Juice | Chips | Bread | Milk | Banana | |
Juice | Chips | ||||
Bread | Butter | Banana | |||
Bread | Butter | Milk | Banana | ||
Juice | Chips | ||||
Bread | Butter | Banana | |||
Chips | Bread | Butter | Milk | Banana | |
Juice | Bread | Butter | Banana | ||
Chips | Bread | Butter | Milk | Banana | |
Chips | Bread | Butter | Banana |
Python Environment Setup
Before we start coding, we need to install the ‘apyori’ module first.
pip install apyori
It is mandatory because ‘apriori‘ is a member of the ‘apyori’ module.
Implementation of Apriori algorithm
We provide here the implementation of Apriori algorithm using Python coding. The objective is to discover the association rules based on support, confidence and lift respectively greater than equal to min_support, min_confidence and min_lift. See the code below.
arm.py
# Step 1: Import the libraries
import pandas as pd
from apyori import apriori
# Step 2: Load the dataset
df = pd.read_csv('transaction.csv', header=None)
# Step 3: Display statistics of records
print("Display statistics: ")
print("===================")
print(df.describe())
# Step 4: Display shape of the dataset
print("\nShape:",df.shape)
# Step 5: Convert dataframe into a nested list
database = []
for i in range(0,30):
database.append([str(df.values[i,j]) for j in range(0,6)])
# Step 6: Develop the Apriori model
arm_rules = apriori(database, min_support=0.5, min_confidence=0.7, min_lift=1.2)
arm_results = list((arm_rules))
# Step 7: Display the number of rule(s)
print("\nNo. of rule(s):",len(arm_results))
# Step 8: Display the rule(s)
print("\nResults: ")
print("========")
print(arm_results)
Output:
Display statistics:
===================
0 1 2 3 4 5
count 19 18 23 23 20 22
unique 1 1 1 1 1 1
top Juice Chips Bread Butter Milk Banana
freq 19 18 23 23 20 22
Shape: (30, 6)
No. of rule(s): 1
Results:
========
[RelationRecord(items=frozenset({'Butter', 'Bread', 'Milk'}), support=0.5,
ordered_statistics=[OrderedStatistic(items_base=frozenset({'Bread', 'Milk'}),
items_add=frozenset({'Butter'}), confidence=0.9375, lift=1.2228260869565217)])]
Explanation
The program generates only one rule based on user-specified input measures such as: min_support = 0.5, min_confidence = 0.7, and min_lift = 1.2.
The support count value for the rule is 0.5. This number is calculated by dividing the number of transactions containing ‘Butter’, ‘Bread’, and ‘Milk’ by the total number of transactions.
The confidence level for the rule is 0.9375, which shows that out of all the transactions that contain both ‘Bread’ and ‘Milk’, 93.75 % contain ‘Butter’ too.
The lift of 1.22 tells us that ‘Butter’ is 1.22 times more likely to be bought by the customers who buy both ‘Bread’ and ‘Milk’ compared to the default likelihood sale of ‘Butter.’