Analysis Process MILAR: Mining Indirect Least Association Rule Algorithm

One of the interesting and meaningful information that is hiding in transactional database is indirect association rule. It corresponds to the property of high dependencies between two items that are rarely occurred together but indirectly emerged via another items. Since indirect association rule is nontrivial information, it can implicitly give a new perspective of relationship which cannot be directly observed from the common rule. Therefore, we proposed an algorithm for Mining Indirect Least Association Rule (MILAR) from the real and benchmarked datasets. MILAR is embedded with our scalable least measure namely Critical Relative Support (CRS). The experimental results show that MILAR can generate the desired rules in term of least and indirect least association rules. In addition, the obtained results can also be used by the domain experts to do further analysis and finally reveal more interesting findings.


Introduction
The field of data mining is still relatively new and at evolution stage. It stands at the converging of the field of machine learning and statistics. Until this recent, data mining has been successfully employed in variety of domain applications. In simple definition, data mining is the process of extracting some new nontrivial information from large data repository. For more comprehensive definition, it is about making analysis convenient, scaling analysis algorithms to large databases and providing data owners with easy to use tools in helping the user to navigate, visualize, summarize and model the data [1]. In summary, the ultimate goal of data mining is more towards knowledge discovery. The core models in data mining including association rule, prediction, predictive analysis, data reduction, data exploration and data visualization [2]. Association rule was first introduced by Agrawal et al. [3] in 1993, and since then several extensively studies have been made [4][5][6][7][8][9][10][11][12].
It is unsupervised learning and also known as affinity analysis. In marketing, association rule is prevalently known as market basket analysis in an attempt to extract the group of products that tend to be purchased together. Generally, it aims at discovering the interesting relationship between a set of items that are frequently occurred together in transactional database [13]. Apriori is among one of the most earlier and popular algorithms in mining and judging the strength of the rules. However the drawback of this concept is, infrequent or least items are automatically considered as not important and pruned out during the rules generation. In spite of this and especially in certain domain applications, least items may also provide a new and useful insight about the data such as competitive product analysis [14], text mining [15], web recommendation [16], biomedical analysis [17], and etc. The classical association rule which is derived from the frequent items is also popularly known as direct association rule. The contradiction of this rule namely indirect association rule [18] which refers to a pair of items that are rarely occurred together but their existences are highly depending on the presence of mediator itemsets.
It was first proposed by Tan et al. [14] for interpreting the value of infrequent patterns and effectively pruning out the uninteresting infrequent patterns. Recently, the problem of indirect association mining has become more and more important because of its unique contributions in various domain applications [18][19][20][21][22]. From the literature, the studies on indirect association mining can be divided into two categories, either focusing on proposing more efficient mining algorithms [15,18,22] or extending the definition of the indirect association for different domain applications [6,18,19]. The process of discovering indirect association rule is a nontrivial and usually relies more on the existing interesting measures that has been detailed in [14]. However, most of the measures are not properly evaluated in term of the least association rule. Therefore, in this paper we propose Mining Indirect Least Association Rule (MILAR) algorithm by utilizing the strength of Least Pattern Tree (LP-Tree) data structure [11]. In addition, Critical Relative Support (CRS) measure [23] is also embedded in the algorithm to mine the indirect least association rules among the least rules. Indeed, CRS has been widely employed in measuring the least association rule [23,[27][28][29][30][31][32][33][34][35].
The rest of the paper is organized as follows. Section 2 describes the related work. Section 3 explains the proposed method. This is followed by performance analysis through two experiment tests in section 4. Finally, conclusion and future direction are reported in section 5.

Research Method
Generally, indirect association is closely related to negative association, they are both dealing with item sets that do not have sufficiently high in term of support. The negative associations" rule was first pointed out by Brin et al. [24]. The focused on mining negative associations is better that on finding the item sets that have a very low probability of occurring together. Indirect associations provide an effective way to detect interesting negative associations by discovering only frequent it empairs that are highly expected to be frequent. Until this recent, the important of indirect association between items has been discussed in many literatures. Tan et al. [14] proposed INDIRECT algorithm to extract indirect association between item pairs using the famous Apriori technique.
There are two main steps involved. First, extract all frequent items using standard frequent pattern mining algorithm. Second, find the valid indirect associations from the candidate indirect association from candidate item sets. Wan et al. [15] introduced HI-Mine algorithm to mine a complete set of indirect associations. HI-Mine generates indirect item pair set (IIS) and mediator support set (MSS), by recursively building the HI-struct from database. The performance of this algorithm is significantly better than the previously developed algorithm either for synthetic or real datasets. IS measure [25] is used as a dependence measure. Lin et al. [26] proposed GIAMS as an algorithm to mine indirect associations over data streams rather than static database environment. GIAMS contains two concurrent processes called PA-Monitoring and IA-Generation. The first process is to set off when the users specify the required window parameters. The second process is activated once the users issues queries about current indirect associations. In term of dependence measure, IS measure [25] is again adopted in the algorithm. Chen et al. [18] proposed an indirect association algorithm that was similar to HI-mine, namely MG-Growth. The aim of MG-Growth is to discover indirect association patterns and its extended version is to extract temporal indirect association patterns. The differences between both algorithms are, MG-Growth used the directed graph and bitmap to construct the indirect item pair set. The corresponding mediator graphs are then generated for deriving a complete set of indirect associations. In this algorithm, temporal support and temporal dependence are used in this algorithm.
Kazienko [16] presented IDARM* algorithm to extracts complete indirect associations rules. In this algorithm, both direct and indirect rules are joined together to form a useful of indirect rules. Two types of indirect associations are proposed named partial indirect association and complete ones. The main idea of IDARM* is to capture the transitive page from user-session as part of web recommendation system. A simple measure called Confidence [3] is employed as dependence measure. Lin et al. [36] introduced EMIA-LM algorithm for mining indirect association rules over web data stream. EMIA-LM uses a mediator-exploiting search strategy in the process of generating the rule. It also adopts a compact data structure, alleviates unnecessary data transformation processes and minimizes the usage of memory. The preliminary experiments also showed that EMIA-LM is better than HI-mine* for static data in term of computational speed and memory consumption. Liu et al. [37] suggested FIARM (Filtering-Based Indirect Association Rule Mining) algorithm to analyze gene microarray data. It is a Apriori-based algorithm. The algorithm can determine indirect gene associations to assist the biologists in finding a new insightful knowledge. FIARM-Measure is also introduced to help in discovering indirect association rules from the rules that have a negative correlation. In the analysis, Gene Ontology is employed to verify the accuracy of the relationships. Hajian and Domingo-Ferrer [38] proposed a new technique for mining direct and indirect discrimination prevention of rules. There are two phases involved; discrimination measurement and data transformation. Discrimination measurement determines alpha-discriminatory rules and also redlining rules. In data transformation, the original data will be transformed by removing direct and/or indirect discriminatory biases. This process ensures that no unfair decision rule can be mined from the transformed data.

Indirect Association Rule
CRS value is between 0 and 1, and is determined by multiplying the highest value either supports of antecedent divide by consequence or in another way around with their Jaccard similarity coefficient. It is a measurement to show the level of CRS between combination of the both Least Items and Frequent Items either as antecedent or consequence, respectively. Here, Critical Relative Support (CRS) is employed as a dependence measure for 2(a) in order to mine the desired Indirect Association Rule.

Algorithm Development
Determine Minimum Support. An itemset is a set of item. A k-itemset is an itemset that contains k items. From Definition 6, an itemset is said to be least (infrequent) if it has a support count less than .
Construct LP-Tree. A Least Pattern Tree (LP-Tree) is a compressed representation of the least itemset. It is constructed by scanning the dataset of single transaction at a time and then mapping onto a new or existing path in the LP-Tree. Items that satisfy the  (Definition 6 and 7) are only captured and used in constructing the LP-Tree.
Mining LP-Tree. Once the LP-Tree is fully constructed, the mining process will begin using bottom-up strategy. Hybrid "Divide and conquer" method is employed to decompose the tasks of mining desired pattern. LP-Tree utilizes the strength of hashbased method during constructing itemset in support descending order.

Construct Indirect Patterns.
The pattern is classified as indirect association pattern if it fulfilled the two conditions. The first condition is elaborated in Definition 8 where there are three sub-conditions. One of them is mediator dependence measure. CRS from Definition 9 is employed as mediator dependence measure between itemset in discovering the indirect patterns.

Result and Discussion
In this section, the analysis is made by comparing the total number of association rules being extracted based on the predefined thresholds using our proposed algorithm, MILAR. In the analysis, three items are involved in forming a complete association rule; two items as an antecedent and one item as a consequence. The mediator is appeared as a part of antecedent. We conducted our experiment using two datasets. The experiment has been performed on Intel® Core™ 2 Quad CPU at 2.33GHz speed with 4GB main memory, running on Microsoft Windows Vista. All algorithms have been developed using C# as a programming language.
The first dataset is language anxiety dataset. The dataset was taken from a survey on exploring language anxiety among engineering students at University Malaysia Pahang (UMP  Item is constructed based on the combination of survey dimension and its likert scale. For simplicity, let consider a survey dimension "Anxious in the language class" with likert scale "1". Here, an item "11" will be constructed by means of a combination of an attribute id (first characters) and its survey dimension (second character). Figure 2 shows the performance analysis against Language Anxiety dataset. Minimum Support (min supp or  and Mediator Support Threshold t m  are set to 30% and 10%, respectively. Varieties of minimum CRS (min-CRS) were employed in the experiment. During the performance analysis, 286 least association rules and 152 indirect least association rules were produced, respectively. The general trend was, the total number of indirect least association rules were kept reducing when the values of min-CRS were kept increasing. However, there are no changes in term of total least association rules and indirect least association rules when the min-CRS values were in the range of 0.15 until 0.20.  Table 2 displays the mapped of original attributes with new attributes id. Item is constructed based on the combination of attribute id and its domain. For simplicity, let consider an attribute "Clump Thickness" with domain "1". Here, an item "101" will be constructed by means of a combination of an attribute id (first two characters) and its domain (third character).    Table 3 shows the mapped Mushroomedible dataset. The detail domain names for each attribute are not shown. For example, attribute "capshape" can be classified as either bell or conical or convex or flat or knobbed or sunken. Since the "capshape" can be fallen into one of the six categories, thus its domain is in the range of 1 into 6 (1-6) which 1 is represented bell, 2 is correspond to conical and so forth. In Mushroom-edible dataset, item is constructed based on the combination of attribute id and its domain. For simplicity, let consider the attribute "cap-shape", with domain "1".
Here, an item "11" will be constructed by means of a combination of an attribute id (first characters) and its domain (second character or after first characters).  Figure 4 shows the performance analysis of benchmarked Mushroom-edible dataset. The Minimum Support (minsupp or) and Mediator Support Threshold t m  are fixed into 40% and 20%, respectively. Various minimum CRS (min-CRS) were also utilized during the experiment. From the analysis, 651 least association rules and 298 indirect least association rules are generated, respectively. The total number of indirect least association rules are kept reducing when the values of min-CRS were kept increasing.

Conclusion
Mining indirect least association rules from data repository is a nontrivial study and very crucial. It has been specially designed to deal with the rarity cases. In fact, it may contribute in discovering a new knowledge which cannot be easily obtained through typical association rules approaches. Indirect least association rule is by definition represents the property of high dependencies between two items that are rarely occurred together in the same transaction but actually appeared indirectly via another itemset. Therefore, in this paper we proposed Mining Indirect Least Association Rule (MILAR) algorithm to extract the hidden and interesting rules called indirect least association rules from the data repository. MILAR algorithm embeds with a scalable measure called Critical Relative Support (CRS) rather than the common interestingness measures in data mining. We conducted the three experiments based on real dataset and benchmarked datasets. The obtained results show that MILAR algorithm can successfully generate the least and indirect least association rules based. It is expected that obtained information can provide a new insight for domain experts to do further investigation and finally discover a new knowledge. In the near future, we plan to apply MILAR algorithm into several benchmarked datasets and real datasets.