Association Rule Learning - Affinity Analysis

Association Rule Learning is a rule-based machine learning approach to find the relationship between the items in a large number of datasets. This technique is intended to find strong rules based on the customer’s interest in past. For example: When creating inventory our objective is to place items together which are most likely to purchase in groups.

Affinity analysis is a data mining technique is used to discover co-occurrences of items bought in groups or individually. Let’s talk about association rule learning in retail where affinity analysis concepts are often used to perform Market Basket Analysis to identify the behaviours of customers who are likely to purchase product in retail outlets or online shopping. Market Basket Analysis which also known as MBA, is often used on applications such as finding similar patterns in product segmentation, marketing campaigns, sales promotions, deals and discount on products.

In another example, {Pizza, Ketchup} => {Coke}, if you are ordering pizza online, you would get a recommendation to add ketchup and coke in your cart. Such kind of information helps to build strong decisions when building a marketing strategy for product placements or discounts on products to maximize profit.

Main Concepts in Association Rule Learning:

For the simplicity of the discussion, find the below assumptions:

X, Y = Itemsets {Milk, Bread, Butter, Beer, Diapers}
T = Total number of transactions in a dataset {5 Orders}, can also be denoted as database.

Support
Confidence
Lift

Example Database:

Order ID	Milk	Bread	Butter	Beer	Diapers
1	1	1	0	0	0
2	0	0	1	0	0
3	0	0	0	1	1
4	1	1	1	0	0
5	0	1	0	0	0

Fig: Simple table having items bought Association with Order ID

Support: Support is an indication of how frequently the items are bought together or with a combination of other items.

Support Formula= Number of time Item was purchased / Total Number of Transactions in Dataset

Items bought together is {Beer, Diapers} = 1 / 5 (20%) since 1 out of 5 items were occurred in all the 5 transactions. This helps to set some restrictions on items. We can set a support threshold value to consider items that can be placed together for better prediction in associate rule learning.

Confidence: Confidence is an indication that how often item A and item B bought together given the frequency of item A. Confidence is the conditional probability that randomly selected items will include all the items in consequent, given that all the items are included in the antecedent. The importance of confidence score plays an important part in association rule learning to decide confidence of the second item.

Confidence Formula: Frequency (A, B) / Frequency A

The higher the confidence, the greater the confidence that the items present in consequent will most likely to go with the antecedent.

Example: {bread, butter} => {milk} has confidence of 0.2/0.2 = 1, which means whenever bread, butter was bought together the probability of buying milk together was 100%.

Lift: Lift is an indication of how likely item B bought when item A was purchased. This given

Lift Formula: Support / Support (A) * Support (B)

Lift Score Threshold values in Association Rule Learning:

If lift = 1 implies no relationship between A and B. Item A not bought with item B.
If lift > 1 implies that there is a positive relationship between A and B. Item A is most likely to go with item B.
If lift < 1 implies that there is a negative relationship between A and B. Item A is unlikely to go with item A.

Types of Association Rule Learning

Apriori Algorithm: Apriori Algorithm is the most widely used algorithm in Machine Learning problems. It is used for frequent item set mining and association rule learning over relational datasets. The algorithm works by searching similar type of items in item datasets or databases by extending more and more items which can be clubbed together. Apriori uses breadth-first search and a Hash tree structure to count the numbers of item sets efficiently.
Eclat Algorithm: Eclat stands for Equivalence Class Transformation. Eclat uses a depth-first search for discovering frequent itemsets instead of a breath-first search. It is faster than apriori algorithm and well suited for small to medium sized datasets.
FP Growth Algorithm: FP stands for Frequent Pattern Growth. FP growth is an improvement over the apriori algorithm. FP growth represents frequent items in frequent pattern trees or FP-tree.

Read More: Data Cleaning using NLP