Make your own free website on Tripod.com
An association rules add-on for Microsoft Access Implementation of Apriori Algorithm Final Report By S C Sandur KiranChand Bobba Shekar Vemuri Date: Dec 2nd, 03 Instructor: Dr. Lin University of Memphis Objective: The goal of this project is to build a tool that can be used to mine association rules on data stored on a relational database system in MS Access. Goals: • At the end of the project we plan to develop a piece of software written in java to mine association rules using Apriori algorithm. • The system will allow the user to specify which table to use and which attribute to use and if the attributes are not binary/categorical, how they should be grouped. • The user will be able to submit the minimum support and confidence. • Mid-term goals- the user will be able to mine association rules which are binary. Apriori Algorithm: The goal of apriori algorithm is to mine association rules by finding large itemsets and translating them to the corresponding association rules A => B, or A1^ A2^Am=> B1^B2^ Bn, where A B= Some of the terminologies used are confidence, support, k-itemset – a set of k items, large itemsets – the large itemset {A, B} corresponds to the following rules (implications) A =>B and B=> A. Step 1 – generate a candidate set of 1-itemsets C1–Every possible 1-element set from the database is potentially a large itemset, because we don’t know the number of its appearances in the database in advance. The task adds up to identifying (counting) all the different elements in the database; every such element forms a 1-element candidate set, Now, to scan the entire database, to count the number of appearances for each one of these elements (i.e. one-element sets). Step 2- •Step 2 – generate a set of large 1-itemsets L1–Each element in C1 with support that exceeds some adopted minimum support becomes a member of L1. Step 3 – generate a candidate set of large 2-itemsets, C2 C2 = L1 (cross-product) Count the corresponding appearances . Step 4 – generate a set of large 2-itemsets, L2, Eliminate the candidates without minimum support by scanning the database. Step 5- •Do this until some L , Ln (cross-product) Ln=null. Implementation of Apriori: Program Classes Candidate -to store the Candidate CandidateSet- to store set of Candidates at some level l FrequentItemSetArray- to store the frequent Item Sets. It stores all frequent ItemSets which satisfy min support at various all Levels. DataAccesor- to make connection to database and query the database. GenSets- to generate higher level CandidateSet and verify that generated CandidateSet’s subsets exist in the lower level CandidateSet. Pruner- to prune all CandidateSets whose support is less than min support. RuleGenerator- to generate rules from a CandidateSet Buckets- to create buckets of columns based on users and creates a new table CategoricalData- This will handle the user inputs for categorizing data and will call buckets.java InputReader- Class to read configuration file specified by the user Column Data- Class to store the ranges and buckets specified by the user for each non-categorical data Database used The database used is called a “Transactions”. It has sample transactions of supermarket data. Each Transaction is identified by a TransId which is unique. The database consists of binary data which represents yes or no fields. Handling of non-categorical data It has also non-categorical data such as customer age, customer income and total cost. Performance Testing of Apriori The first testing shows time taken for increasing number of binary columns (the no of columns excluding TransID) from 1 to 12 with fixed support of 0.50 and fixed confidence of 0.25. Here the confidence does not matter The second testing shows time taken for increasing number of binary columns (the no of columns excluding TransID) when Support is 0.25 and Confidence 0.25. The third testing shows time taken for increasing support from 0.0 to 1.0 in steps of 0.1( the no of columns excluding TransID) The number of columns are at a fixed number of 8 and Confidence is 0.25. An interesting fact to be noted is that for support 0.0 the execution time rapidly climbs up to 157797 milliseconds but for other support values between than 0.1 to 1.0 the execution time is below 2000 milliseconds. Extension of the project: We plan to anlayse and implement AprioriTid as an extension to Apriori. In ApriorTid Database D is not used for counting support after the first pass. The set Ck is used, for this purpose. Elements of Ck are in the form where each Xk is a potentially large k-itemset present in the transaction with identifier TID. The member of Ck corresponding to transaction t is Benefit: database gets fewer rows at each stage. Cost: databases generated may be big. Tasks in progress: Implementation of Apriori TID Extensive Testing for Apriori TID FUP Algorithm for incrementing Database References: o "Data Mining: Concepts and Techniques", First edition, Morgan Kaufmann Publishers o www.almaden.ibm.com/cs/quest/