PRIVACY PRESERVING MFI BASED SIMILARITY MEASURE FOR HIERARCHICAL DOCUMENT CLUSTERING P. Rajesh1, G. Narasimha2, N.Saisumanth3 1,3Department of CSE, VVIT, Nambur, Andhra Pradesh, India Email: rajesh.pleti@gmail.com Email: saisumanth.nanduri@gmail.com 2Department of CSE, JNTUH, Hyderabad, Andhra Pradesh, India Email: narasimha06@gmail.com Abstract: The increasing nature of World Wide Web has imposed great challenges for researchers in improving the search efficiency over the internet. Now days web document clustering has become an important research topic to provide most relevant documents in huge volumes of results returned in response to a simple query. In this paper, first we proposed a novel approach, to precisely define clusters based on maximal frequent item set (MFI) by Apriori algorithm. Afterwards utilizing the same maximal frequent item set (MFI) based similarity measure for Hierarchical document clustering. By considering maximal frequent item sets, the dimensionality of document set is decreased. Secondly, providing privacy preserving of open web documents is to avoiding duplicate documents. There by we can protect the privacy of individual copy rights of documents. This can be achieved using equivalence Keywords: Maximal Frequent Item set, Apriori algorithm, Hierarchical document clustering, equivalence relation.
I. INTRODUCTION Document clustering has been studied intensively because of its wide applicability in areas such as web mining, search engines, text mining and information retrieval. The rapid progress of databases in every aspect of human actions has resulted in enormous demand for efficient algorithms for spinning data into Document clustering has undergone through various methods, still document clustering is in its inefficiency state for providing the required information needed by the user exactly and approximately. Suppose the user makes an incorrect If user may not notice his mistakes until he browses into the deep portion of the hierarchy, then it decreases the efficiency of search and increases the number of navigation steps to find relevant documents. So we need a hierarchical clustering that is relatively flat that reduces the number of navigation steps. Therefore there is a great need for new document clustering algorithms, which are more efficient than conventional The increasing nature of World Wide Web has imposed great challenges for researchers to cluster the similar documents over the internet and their by improving the efficiency of search. Search engine uses the getting more confused in selecting the relevant documents among huge volumes of search results returned to a simple query. A potential solution to this problem is to cluster the similar web documents, which helps the user in identifying the relevant data easily The outline of this paper is divided into six sections. section II, briefly discusses related work. We explained our proposed algorithm description including common preprocessing steps and pseudo code of algorithm in section III. It also includes to precisely defining clusters based on maximal frequent item set (MFI) by Apriori algorithm. Section IV, describes exploiting the same maximal frequent item set (MFI) based similarity measure for Hierarchical document clustering with running example. In section V, provides privacy preserving of open web documents by using equivalence relation to protect the individual copy rights of a document.. Section VI, consists of conclusion and future scope.
collection of clusters that is not favorable to interpretation [5, 6]. To minimize the overlapping of documents, Beil, Ester [7] were proposed a method HFTC (Hierarchical Frequent Text Clustering) is another frequent item set based approach to choose the next frequent item sets. But the clustering result depends on the order of choosing next frequent item sets. The resulting hierarchy in HFTC usually contains many clusters at first level. As a result the documents in the same class are to be distributed into different branches of hierarchy, which decreases the overall C.M.Fung [8] has introduced FIHC (Frequent Item set based Hierarchical Clustering) method for document clustering. Which employed, a cluster topic tree is constructed based on the similarity among clusters. FIHC used the efficient child pruning when number of clusters is large and to apply the elaborated The experiment results FIHC actually outperforms all other algorithms (bisecting-k means, UPGMA) in The Apriori algorithm [9] is a well-known method for computing frequent item sets in a transaction database. The document under the same topic, shares more common frequent item sets (terms) than the documents of different topics. The main advantage of using frequent item sets is that it can identify the relation among the more than two documents at a time in a document collection unlike similarity measure between two documents [10, 11].By the means of maximal frequent item sets, the dimensionality of the document set is reduced. More over maximal frequent item sets captures most related document sets. On the other hand, hierarchical clustering most relevant for browsing and maps most specific documents to A conventional hierarchical clustering method constructs the hierarchy by subdividing parent cluster or merging similar children clusters. It usually suffers from its inability to perform tuning once a merge or split decision has been performed. This rigidity may lower the clustering accuracy. Furthermore, due to the fact that a parent cluster in the hierarchy always contains all objects of its Childs, this kind of hierarchy is not suitable for browsing. The user may have difficulty to locate his intention object in such a large Our hierarchical clustering method is completely different. The aim of this paper is, first we form all the clusters by assigning documents to the most similar cluster using maximal frequent item sets by Apriori algorithm and then construct the hierarchical document clustering based on their inter-cluster similarities via same maximal frequent item set (MFI) based similarity measure . The clusters in the resulting hierarchy are non-overlapping. The parent cluster contains only the general documents.
III. ALGORITHM DESCRIPTION In this section, we explained our proposed algorithm description including common preprocessing steps and pseudo code of algorithm. It also includes to precisely defining clusters based on First, we will speak about some common preprocessing steps for representing each document by item sets (terms). Second we will bring in vector space model by assigning weights to terms in all document sets. Finally, we will explain the process of initialization of clusters seeds using MFI to perform hierarchical clustering. Let Ds represents set of all documents in collection of database.
Ds= {d1, d2, d3???dM}: 1 ? i ? M A. Pre-Processing The document set Ds is converted from unstructured format into some common representation using the text preprocessing techniques, in which words or terms are extracted (tokenization). The input data set of documents in Ds are preprocessed using the techniques namely, removing HTML tags first, after a) HTML Tags: parsing of HTML Tag b) Stop words: Remove the stop words list like ?conjunctions, connectives, prepositions etc? c) Stemming algorithm: We utilize porter 2 stemmer algorithm in our approach.
B. Vector representation of document: Vector space model is the most commonly used document representation model in text mining, web mining and information retrieval areas. In this model each document is represented as n-dimensional term vector. The value of each term in the n-dimensional vector reflects the importance of corresponding document. Let N be the total number of terms and M be the number of documents and each the document ?? = (???????? , ???????? , ? ? ? ? . . ???????? ) can ) < ?????b??e????????? denoted as ????(???????????? ??1 1? i? M. Where ?? ??2 ???? ???????????? value. The document frequency is less than the threshold value is considered to avoid the problem of more times a term appears throughout all documents in the whole collection, the more poorly it discriminates between documents [12].Calculate term frequency tf is number of times a term appears in a document. Document frequency of a term df as no of documents that ????=(?????? ???? ??1, ?? , ?? , ? ? . . , ??1 ) contains term. Also construct the weights for Where ?? ???? ? ??1??2 ??(1??3) and ???? documents vectors.
IDf (j) =?????? ???????? 1?j?n.where IDf is the inverse A frequent item set is a set of words which occurs frequently together and are good candidates for document frequency. clusters and are denoted by FI. An item set X is closed such that X ? X1 and t(X) = t(X1), where t(X) defined Table 1: Table Representation of Transactional Database of if there does not exist an item set X1 such that X1, Documents Terms Doc 1 Doc 2 Doc 3 ..... Doc 4 Java 1 1 0 ..... 1 Beans 0 1 0 ..... 0 Servlets 1 0 1 ..... 1 By the representation of document as vector form, we can easily identify which documents Contains the same features .The more features documents have in common, the more related they are. Thus, it is realistic to find well related documents. Assume that each document is an item in the transactional database; each term corresponds to a transaction. Our aim is to search for highly related documents ?appearing? together with same features (the documents whose MFI features are closed). Similarly, the maximal frequent item set discovery in the transaction database serves the purpose of finding items of documents appearing together in many transactions. i.e., document sets C. Apriori for maximal frequent item sets Mining frequent item sets is a primary content of data mining that emphasizes particularly in finding the relation of different items in the large database. Mining frequent patterns is crucial problem in many data mining applications such as the discovery of association rules, correlations, multidimensional patterns, and other numerous important inferring patterns from consumer market basket analysis and web access etc. The association mining problem is formulated as follows: Given a large data base of set of items transactions, find all frequent item sets, where a frequent item set is one that occurs in at least a user- specified threshold value of the data base. Many of the proposed item set mining algorithms are a variant of Apriori, which employs a bottom-up, breadth first Apriori is a conventional algorithm that was first introduced] for mining association rules. Association can be viewed as two-step process as (1) Identifying all frequent item sets (2) Generating strong association rules from the frequent item sets At first, candidate item sets are generated and afterwards frequent item sets are mined with the help of these candidate item sets. In the proposed approach, we have used only the frequent item sets for further processing so that, we undergone only the first step (generation of maximal frequent item sets) of the Apriori algorithm.
as the set of transactions that contain item set X and it is denoted by FCI(frequently closed items).If X is frequent and no superset of X is frequent among the set of items I in transactional databases. Then we say MFI. Then MFI? FCI ? FI Whenever there are very that X is maximal frequent item set and denoted by long patterns are present in the data it is often impractical to generate the entire set if frequent item sets or closed item sets [16]. In that case, maximal We employed maximal frequent item set algorithm from [17] using apriori. These maximal frequent item sets are initial seeds for hierarchical document D. Pseudo code Algorithm For MFI Based Similarity Measure for Hierarchical Document Clustering (tf) Term frequency and (df) document frequency Step 1. For each document in Ds, Remove the HTML Step 2. Calculate the term frequency (tf) and document ???? = (??????????1, ??????????2, ? ? ? ? . . ????????????) Where df???????????? < Threshold value 1?i?M Step 3. Also construct the weighted document vectors ???? = (????1, ??12, ??13, ? ? . . , ??1????) Where ?? = ???? ? for all the documents ???? ???? ??????(??).Idf (j) =?????? ???????? Step 4. Now represent each documents by keywords whose tf>support Calculate the Maximal Frequent Item set(MFI) of ?????? = {??1, ??2, ??3, ? ? ? ? . . ????} terms using Apriori algorithm Where each ???? = {??1, ??2, ??3, ? ? ? ????} Step 5. If a document ???? choose ?? is in more than on??e maximal containing document t??h??enThen Assign?? a=s?? a set frequent item set consisting of such maximal frequent??item sets ??0 the document ???? .For .
Then assign ???? = ??????.Assign the document ????to ???? and discard ???? Repeat this process for all documents that occurs in more than one maximal frequent item set these maximal frequent item sets ?? Step 6. Apply hierarchical document clustering to make and combine the documents in ???? ?? as clusters into a single new document and represent it by centers of the maximal frequent item sets. These are obtained by combining the features of maximal frequent item set of terms that grouping the documents Step 7. Repeat the same process of hierarchical document clustering based on maximal frequent item sets for all levels in hierarchy and stop if total number of documents equals to one else go to step 4.
IV. HIERARCHICAL CLUSTERS BASED ON MAXIMAL FREQUENT ITEM SETS After finding maximal frequent item sets (MFI) by using Apriori algorithm. We turn to describing the creation of hierarchical document clustering using same similarity measure by MFI. A simple instance case of example is also provided to demonstrate the documents ?? entire process. The set of maximal frequent it}em?? sets are ?????? = {?? , ?? , ?? ? . . ?? among the whole collection of by 123 ?? apriorialgorithm ? . . ????} .Where by?? = {?? , ?? , ?? each MFI consist of set of documents represented ?? 1 2 3 .Then consider total number of documents which occurs in maximal frequent item ?????? = ????1, ??2, ??3,??4, ??5, ??6, ??7, ??8, sets in MFI as follows. 9, ??10, ??11, ??12, ??13, ??14, ??15 ??1 = {??2, ??4, ??6} ??23 = {??3, ??4, ??8} = {??1, ??5, ??7} ??4 = {??4, ??2, ??14} ??56 = {??10, ??12, ??15} = {??9, ??11, ??13} The clusters in the resulting hierarchy are non- overlapping. This can be achieved through the Case1: If ????, ???? are same then choose one in random Case2: If ????, ???? in?? , ?? to form cluster.
are diff??ere??nt then form clusters of in ??3, ??5 and ??6 independently. In our documents contain??e??d?? different. So we form a clusters example, the maximal frequent item set of documents according to the documents contained in ???? ???????? ??3 = {??1, ??5, ??7} as one cluster in hierarchy If ???? , ???? consider t3h:e case of document ??2 Case contains some same documents among the documents list obtained from MFI. Let us th4an one maximal fre{q??uent item sets{?? ?? 1, ??2, ??4} ?? 1 4}.Similarly is repeatedin more {????1, ??2, ??4} = {????0, ????1, ????2} documencth??oose???? = ??=????is0 =re??p1eated in doc.ument ?? 4 Then ??in ????????????2 for .Assign For each the maximal frequent4item sets .
containing the from ??0 ????[????????????????(???????????? ( ????, ??4)) ))] calculate the measure > ????????????????(???????????? ( ??????, ??4 document ??4 By using this jaccards measure, we can identify the closest to which maximal frequent item document ??4.Then assign ???? = ?????? set among maximal frequent item sets containing the that ??4 document?? to?? = set ?? ?? L=et'??s and discard4 ?? closed 4 ?? suppose is to the maximal ???? 4 4 for other maximal frequent frequent item . Assign exactly one cluster. Similarly ?? belongs to?? item sets. After this step, each 2document belongs to 1.Repeat this process for all documents that occurs in more than ?? , ?? are repeated in?? , ?? one maximal frequent item set. Since the documents 2 4 1 4. The clusters that will form at the first level of hierarchy by applying step5 and ??1 = {?? , ?? } ??23 = {??23, , ??8} 6 ??4 = {??1, ??5, ??7} = {??4, , ??14} ??56 = {??10, ??12, ??15} 9, ??11, ??13} The hierarchical diagram for the above form of maximal frequent item set clusters can be representing as follows. Repeat the same process of hierarchical document clustering based on maximal frequent item sets for all levels in hierarchy and stop if total number of documents equals to one else go to step 4.
Represent each new document ?????? in hierarchy by maximal frequent item set of terms as centers (as in step 6).These maximal frequent item sets are obtained by combining the features of maximal frequent item set of terms that grouping the documents. Each new document also consisting of corresponding updated ?? weights of maximal frequent item set of terms. Where hie????rarchy?? {?? = ?? } represents that jth document in the level of ?? 12 21 level ?? . In the figure means that the maxim1al frequent item set of terms in 2nd document of level?? are not1matched with other documents MFI set in s}ame .So it is repeated same for th{e??1n3ext ??le2v2el and it is also s{a??me, ??for the document} = . The documents 11 15} and{??14, ??16 in first level are combined using MFI based hierarchical level as ?? , ?? clustering and represent these documents in the second 23 24.
V. PRIVACY PRESERVING OF WEB DOCUMENTS USING EQUIVALENCE RELATION Most internet web documents are publicly available for providing services required by the user. In such documents there is no confidential or sensitive data (open to all). Then how can we provide privacy of such documents. Now a days, same information will be exists in more than one document in duplicate forms. The way of providing privacy preserving of documents is by avoiding duplicate documents. There by we can protect the privacy of individual copy rights of documents. Many duplicate document detection techniques are available such as syntactic, URL based, semantic approaches. In each technique, a processing overhead of maintaining shingling's, signatures, fingerprints [13, 14, 15, 18]. In this paper, we proposed a new technique for avoiding duplicate documents using equivalence relation. Let Ds be the input duplicate document set is subset to web document collection. First find the jaccard similarity measure for every pair of documents in Ds using weighted feature representation of maximal frequent item sets discussed in step 2 and step 3 in algorithm. If the similarity measure of two documents is equal to 1, then the two documents are most similar. If the measure is 0, then they are not duplicates. The Jaccard index or the Jaccard similarity coefficient is a For two sets, it is denoted as the cardinality of their Mathematically??(??1, ??2) = |??1 ? ??2| For every pair of two documents calculate jaccard measure of d1, d2.All the diagonal elements in matrix are ones, because every document mostly related to itself. When we are classifying the documents into equivalence classes, we are not considering these ones and put zeros. Jaccard similarity coefficient matrix for d1 d2 d3 d4 d1 ? 1 0.4 0.8 0.5? ?? R? = d ?0.4 1 0.8 0.4? 2
d 3 ?0.8 0.8 1 0.9? ?? d 4 ?0.5 0.4 0.9 1 ? D = {d , d , d , d } s Wher1e al2pha3 is threshold. Let define a relation R on 4 as the collectionof document pairs value. i.e ?? = {(????, ????)/ ?? (????, ????) ? ???????????????? } whose similarity measure is above some threshold 1. R is reflexive on Ds iff ?? (????, ????) = 1. i.e Every R is symmetric on Ds iff?? ????, ???? = ?? ????, ????i.e if the document ???? is similar to ???? 2. document ???? is also similar to???? then the
3. R is transitive on Ds iff Then R is an equivalence relation on Ds, which partitions the input document set Ds into set of equivalence classes. Equivalence relation seems a natural technique for duplicate document categorization. Any two documents in same equivalence class are related and are different if they are coming from two equivalence classes. The set of High syntactic similarity pairs of documents typically referred to as duplicates or near duplicates except diagonal elements. By using equivalence relation, easily we can identify the duplicate documents or we Apart from the representation of feature document vector by MFI, we also need to consider that who is the author of document, when the document was created, where it is available, helps in effectively finding the duplicate documents. Each document in input Ds must belong to unique equivalence class. If R Then number of equivalence relations on Ds is always lies between n ? | R|? n2. i.e the time complexity of .i.e?? ????, ???? ? 0.8. Since the matrix is symmetric, the Choose the threshold ? in equivalence relation as 0.8 sets {(??3, ??1), (??3, ??2), (??4, ??3)} documents are mostly related. Hence the documents are near duplicates and grouping the documents into clusters thereby providing privacy of individual copy rights of documents.
?0 0 1 0? ?? ?0 0 1 0? R0.8 = ?1 1 0 1? ?? ?0 0 1 0?
VI. CONCLUSION AND FUTURE SCOPE Cluster analysis can be used as powerful ,stranded alone data mining concept that gains insight information of knowledge from huge unstructured databases. Most conventional clustering methods do not satisfy the document clustering requirements such as high dimensionality, huge volumes and easy of accessing meaningful clusters labels. In this paper, we presented novel approach; Maximal frequent item set (MFI) Based Similarity Measure for Hierarchical Dimensionality reduction can be achieved through MFI. By using the same MFI similarity measure in hierarchal document clustering, the number of levels will be decreased. It is easy for browsing. Clustering has its paths in many areas, by applying MFI based techniques to clusters, including data mining, statistics, biology, and machine learning we can get the high quality of clusters. Moreover, by means of maximal frequent item sets, we can predict the most influenced objects of clusters in the entire dataset of applications like business, marketing, world wide web, social networking analysis.
VII. REFEERENCES [1] Ruxixu, Donald Wunsch., ?A Survey of Clustering Algorithms?. In the Proceedings of IEEE Transactions [2] Jain, A.K., Murty, M.N., Flynn, P.J., ?Data Clustering: A Review?. In the Proceedings of ACM Computing [3] Kleinberg, J.M., ?Authoritative Sources in a Hyperlinked Environment?. In the Journal of the ACM, [4] Ling Zhuang, Honghua Dai. (2004). ?A Maximal Frequent Item Set Approach for Web Document Clustering?. In Proceedings of the IEEE Fourth International Conference on Computer and Information Michael, W., Trosset. (2008). ?Representing Clusters: k-Means Clustering, Self-Organizing Maps and Multidimensional Scaling?. Technical Report, Department of Statistics, Indian University, (2000). ?A Comparison of Document Clustering Techniques?. In Proceedings of the Workshop on Text Beil, F., Ester, M., Xu, X. (2002). ?Frequent Term- Based Text Clustering?. In Proceedings of 8th International Conference on Knowledge Discovery and Data mining 2002 (KDD-2002), Edmonton, Alberta, ?Hierarchical Document Clustering using Frequent Item Sets?. In Proceedings SIAM International Conference [9] Agrawal, R., Srikant, R. (1994). ?Fast Algorithms for Mining Association Rules?. In the Proceedings of 20th International Conference on Very Large Data Bases, [10] Liu, W.L., and Zeng, X.S. (2005). ?Document Clustering Based on Frequent Term Sets?. Proceedings [11] Zamir, O., Etzioni, O. (1998). ?Web Document Clustering: A Feasibility Demonstration?. In the [12] Kjersti, (1997). ?A Survey on Personalized Information Filtering Systems for the World Wide Web?. Technical [13] Prasannakumar, J., Govindarajulu, P., ?Duplicate and European Journal of Scientific Research ISSN 1450- 216X Vol.32 No.4 ,2009, pp:514-527 [14] Syed Mudhasir,Y., Deepika,J., ?Near Duplicate Detection and Elimination Based on Web Provenance for Efficient Web Search?. In the Proceedings of International Journal on Internet and Distributed [15] Alsulami, B.S., Abulkhair, F., Essa, E., ?Near Duplicate Document Detection Survey?. In the Proceedings of International Journal of Computer Science and (2001). ?A Maximal Frequent Itemset Algorithm for Transactional Databases?. In the Proceedings of ICDE, 17th International Conference on Data Engineering [17] Murali Krishna, S., Durga Bhavani, S., ?An Efficient Approach for Text Clustering Based On Frequent Item Sets?. European Journal of Scientific Research ISSN [18] Lopresti, D.P. (1999). "Models and Algorithms for Duplicate Document Detection". In the Proceedings of Fifth International Conference on Document Analysis and Recognition 1999 (ICDAR-1999), 20th-22th Sep, pp:297-300.