An association rules add-on for Microsoft Access
Implementation of Apriori Algorithm
Final Report
By
S C Sandur
KiranChand Bobba
Shekar Vemuri
Date: Dec 2nd, 03
Instructor: Dr. Lin
University of Memphis
Objective: The goal of this project is to build a tool that can be used to mine association rules on data stored on a relational database system in MS Access.
Goals:
• At the end of the project we plan to develop a piece of software written in java to mine association rules using Apriori algorithm.
• The system will allow the user to specify which table to use and which attribute to use and if the attributes are not binary/categorical, how they should be grouped.
• The user will be able to submit the minimum support and confidence.
• Mid-term goals- the user will be able to mine association rules which are binary.
Apriori Algorithm:
The goal of apriori algorithm is to mine association rules by finding large itemsets and translating them to the corresponding association rules
A => B, or A1^ A2^Am=> B1^B2^ Bn, where A B=
Some of the terminologies used are confidence, support, k-itemset – a set of k items, large itemsets – the large itemset {A, B} corresponds to the following rules (implications)
A =>B and B=> A.
Step 1 – generate a candidate set of 1-itemsets C1–Every possible 1-element set from the database is potentially a large itemset, because we don’t know the number of its appearances in the database in advance. The task adds up to identifying (counting) all the different elements in the database; every such element forms a 1-element candidate set, Now, to scan the entire database, to count the number of appearances for each one of these elements (i.e. one-element sets).
Step 2- •Step 2 – generate a set of large 1-itemsets L1–Each element in C1 with support that exceeds some adopted minimum support becomes a member of L1.
Step 3 – generate a candidate set of large 2-itemsets, C2
C2 = L1 (cross-product) Count the corresponding appearances .
Step 4 – generate a set of large 2-itemsets, L2, Eliminate the candidates
without minimum support by scanning the database.
Step 5- •Do this until some L , Ln (cross-product) Ln=null.
Implementation of Apriori:
Program Classes
Candidate -to store the Candidate
CandidateSet- to store set of Candidates at some level l
FrequentItemSetArray- to store the frequent Item Sets. It stores all frequent ItemSets which satisfy min support at various all Levels.
DataAccesor- to make connection to database and query the database.
GenSets- to generate higher level CandidateSet and verify that generated CandidateSet’s subsets exist in the lower level CandidateSet.
Pruner- to prune all CandidateSets whose support is less than min support.
RuleGenerator- to generate rules from a CandidateSet
Buckets- to create buckets of columns based on users and creates a new table
CategoricalData- This will handle the user inputs for categorizing data and will call buckets.java
InputReader- Class to read configuration file specified by the user
Column Data- Class to store the ranges and buckets specified by the user for each non-categorical data
Database used
The database used is called a “Transactions”. It has sample transactions of supermarket data. Each Transaction is identified by a TransId which is unique. The database consists of binary data which represents yes or no fields.
Handling of non-categorical data
It has also non-categorical data such as customer age, customer income and total cost.
Performance Testing of Apriori
The first testing shows time taken for increasing number of binary columns (the no of columns excluding TransID) from 1 to 12 with fixed support of 0.50 and fixed confidence of 0.25. Here the confidence does not matter
The second testing shows time taken for increasing number of binary columns (the no of columns excluding TransID) when Support is 0.25 and Confidence 0.25.
The third testing shows time taken for increasing support from 0.0 to 1.0 in steps of 0.1( the no of columns excluding TransID) The number of columns are at a fixed number of 8 and Confidence is 0.25. An interesting fact to be noted is that for support 0.0 the execution time rapidly climbs up to 157797 milliseconds but for other support values between than 0.1 to 1.0 the execution time is below 2000 milliseconds.
Extension of the project:
We plan to anlayse and implement AprioriTid as an extension to Apriori. In ApriorTid
Database D is not used for counting support after the first pass. The set Ck is used, for this purpose. Elements of Ck are in the form where each Xk is a potentially large k-itemset present in the transaction with identifier TID. The member of Ck corresponding to transaction t is
Benefit: database gets fewer rows at each stage. Cost: databases generated may be big.
Tasks in progress:
Implementation of Apriori TID
Extensive Testing for Apriori TID
FUP Algorithm for incrementing Database
References:
o "Data Mining: Concepts and Techniques", First edition, Morgan Kaufmann Publishers
o www.almaden.ibm.com/cs/quest/