Discovering web document clustering using weighted score matric and fuzzy logic

In computer forensic analysis, hundreds of thousands of files are usually examined much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed. In particular, clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis. Previously used clustering algorithms deals with issues like handling outliers, data preparation .The main theme of clustering based techniques is to extract features from the web documents using conditional random field methods and build a fuzzy linguistic topological space based on the associations of features.

I. INTRODUCTION

The basic idea of web clustering is hundreds of thousands of files are usually examined to come to a conclusion. So there is a diverse need to find the quick method that can group the required documents. Web documents are complex and heterogeneous.

Applying clustering techniques on web document arises mainly from the fact that huge number of the web pages creates the complexities in the user to cluster them semantically. This need can be for many different purposes like forensic analysis, mood detection of the web users and many other activities. Experiments have been performed with different combinations of parameters, result in many different instantces of algorithms.

Web mining has fuzzy characteristics, so fuzzy clustering is sometimes better suitable in comparison with conventional clustering. There are two basic methods of fuzzy clustering, one which is based on fuzzy c-partitions, is called a Fuzzy C-Means (FCM) clustering and another based on the fuzzy equivalence relations, is called a Fuzzy Equivalence Clustering. Data mining technique called association analysis, which is useful for discovering interesting relationship hidden in large data set also useful for clustering. There are two broad principles use for association analysis [1].One is Apriori and another is Frequent Pattern (FP) growth principle. FP-growth is a divide and conquers strategy that mines a complete set of frequent item sets without candidate generation. FP-growth outperformance Apriori because Apriori incurs considerable I/O overhead since it requires making several passes over the transaction data set. In this paper a method is being proposed of web document clustering based on FP-growth and FCM that helps the search engine to retrieve relevant web documents needed for any user. Documents in the FCM are strongly correlated; however traditional FCM clusters are sensitive to the initialization of membership matrix and center. It also needs the number of clusters to be formed as initial parameter. Our approach handles all this by using FP-growth approach which initializes this for FCM.

Text representation is the step of selecting features to represent text that will be clustered. Feature selection is a process of identifying the most effective subset of the original features to be used in clustering. Extraction of features is the process of using linear or non-linear transformations on original features to generate projected features to be used in clustering [3].

Numbers of algorithms like k-mean, agglomerative clustering are used for clustering purpose. Previously used algorithms deals with issues like handling outliers, data preparation etc. In this paper, pre process unstructured document to structured data, then idea is to extract four features, such as title, numeric words, nouns and term weights. This makes it much simpler than any other methods. Then System neglecting unwanted extension’s. The grouping of these scored values represents the most accurate clustered documents. Other applications are social media, data mining, trend analysis, banking sector, market analysis, and so forth.

This paper can be classified as follows: Section I dedicated for Introduction. Section II reserved for Related Work, Section III is allocated for System Description and finally section IV is done with conclusion.

II. RELATED WORK

To put forward the idea of “Discovering Web Document Clustering Using Weighted Score Matrix And Fuzzy Logic”.This paper analyzes many concepts of different authors as mentioned below:

N. L. Beebe introduces This paper introduces an approach to overcome trace- ability issues in digital forensic investigation process.Digital crime inflicts immense damage to users and systems and now it has reached a level of sophistication that makes it difficult to track its sources or origins especially with the advancements in networks and the availability of diverse digital devices. Forensic has an great role to facilitate investigations of illegal activities and inappropriate behaviors using scientific, investigation frameworks and techniques methodologies. Digital forensic is invented to investigate any digital devices in the detection of crime. This paper focused on the research of trace- ability aspects in digital forensic investigation process. This consist exploring of complex and huge volume of evi- dence and connecting meaningful relationships among them. The objective of this paper is to derive a traceability index as a useful indicator in measuring the accuracy and completeness of discovering the evidence. This index is shown through a model (TraceMap) to facilitate the investigator in tracing and mapping the evidence in order to identify the origin of the crime. In this paper, mapping rate, tracing rate and offender identification rate are used to present the level of mapping ability, tracing ability and identifying the offender ability respectively. This research has a high potential of being expanded into other research areas. [1].

S. Decherchi had analyzed Authorship verification can be checked using stylometric techniques through the analysis of linguistic styles and writing characteristics of the authors. Stylometry is a behavioural feature that a person exhibits during writing and can be extracted and used potentially to check the identity of the author of online data. Stylometric techniques can gain high accuracy rates for large documents, then also it is challenging to identify an author for short documents in particular when dealing with large authors populations. These obstacles must be addressed for stylometry to be usable in checking authorship of online messages such as text messages, emails or twitter. In this paper, we pose few steps forward to achieve that goal by proposing a supervised learning technique combined with n-gram analysis for authorship verification in short texts. Experimental evaluation based on the Enron email dataset involving 87 authors yields very promising results consisting of an Equal Error Rate (EER) of 14.35% for message blocks of 500 characters [2].

Dr.T.Nalini explains clustering is the process of grouping of information, where the grouping is done by finding semantics between data based on their characteristics, this is called as Clusters. A relative study of clustering algorithms across two different data items is performed here. The result of the one or more clustering algorithms is compared based on the time taken to form the estimated clusters. The experimental results of various clustering algorithms to form clusters are depicted as a graph. Thus it can be resulted as the time taken to form the clusters increases as the number of cluster increases. The earlier clustering algorithm takes very few seconds to cluster the data items whereas the simple KMeans takes the longest time to perform clustering [3].

George Forman describes many research in speeding up text mining involves algorithmic improvements to induction algorithms, and hence for many large scale applications, such as classifying large document repositories. The time involved in extracting word features from texts can itself greatly exceed the initial training time. This paper explains a fast method for text feature extraction that folds together Unicode conversion, word boundary detection,forced lower casing , and string hash computation. It show empirically that our integer hash features result in classifiers with equivalent statistical performance to those built using string word features, but require adequate computation and less memory[4].

Giridhar N S explains Information Retrieval (IR) is essentially a matter of deciding which documents in a collection should be retrieved to satisfy a user’s requirement of information such as, query or profile, and contains many search terms, plus few additional data such as weight of the words. So, the retrieval decision is prepared by comparing the terms of the query with the index terms (important words or phrases) appearing in the document itself. The decision may be binary, or it may consist estimating the degree of relevance that the document has to query. The words that appear in documents and in queries often have many morphological variants. So before the information retrieval from the documents the stemming techniques are applied on the target data set to reduce the size of the data set which will increase the effectiveness of IR System. In this paper, surveys of stemming techniques have been presented [5].

III. SYSTEM DESCRIPTION

The volume of data in digital world is increases exponentially, which affects directly on forensic analysis. So there is a diverse need to find the quick method that can group the required documents. Numbers of algorithms like k-mean, agglomerative clustering are used for clustering purpose. Algorithms used previously deals with issues like handling outliers, data preparation etc. So, system is pre-process unstructured document to structured data, then our idea extract four features of each document like title sentences, proper nouns, numeric words, and term weights. This makes it much simpler than any other methods. Then system considering only extensions which are rich in text like .pdf, .doc, .txt. Finally in clustering, a score matrix is created by system of all the documents by comparing with one another to yield a score matrix which contains aggregate feature score. The grouping of these scored values represents the most accurate clustered documents.

This system first creates an interactive web crawler which eventually parses the web pages and collects the data and saves in .txt file format. Then the folder in which these web data is stored is given as the input to the system which then preprocess this data to extract the features like term weight, Numerical data, title sentence and nouns from the collected web page data. And then fuzzy logic is applied to get the feature scores classification pattern and then this is feed to the weighted matrix method to create semantic clusters for the web page documents.

Feature extraction starts from an initial set of measured data and builds derived values (features) intended to be non redundant, informative, facilitating the subsequent learning, in few cases leading to better human interpretations. Feature extraction is similar to dimensionality reduction. When the input data to an algorithm is massive to be processed and it is suspected to be redundant , then it can be transformed into a reduced set of features also called a “features vector”. This process is known as feature extraction. The extracted features are supposed to contain the relevant information from the input data, so that the expected task can be performed by using this reduced representation instead of the complete initial data.

Fuzzy logic can be used as an interpretation model for the properties of neural networks, as well as for giving a more accurate description of their performance. It will show that fuzzy operators can be conceived as generalized output functions of computing units. Fuzzy logic is used without having to apply a learning algorithm by specific network. An expert in a certain field can sometimes produce a simple set of control rules for a dynamical system with less effort than the work involved in training a neural network.

Weighted score matrix used to specify of importance of criteria level. Assigning meaning to weighting factors is subjective. For this reasons, keep the number of weighting factors small.

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a prior task of data mining, and a common technique for statistical data analysis, used in different fields, including pattern recognition, image analysis, machine learning, IR, and bioinformatics. Cluster analysis is not one proper algorithm, but to solve the general task. It can be gained by many algorithms that differ significantly in their notion of what constitutes a cluster and how to effectively find them. Popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Therefore clustering can be formulated as a multi-objective optimization problem.

IV Conclusion

This paper successfully accumulates most of the techniques of many authors as described in section II of related work. So, by analyzing all methods it seem to be like number of method is perfect in providing solution for “Discovering Web Document Clustering Using Weighted Score Matrix And Fuzzy Logic”.

As an effort to this, this paper tries to improve the concept of “Discovering Web Document Clustering Using Weighted Score Matrix And Fuzzy Logic” by introducing clustering based techniques is to extract features from the web documents using conditional random field methods and build a fuzzy linguistic topological space based on the associations of features.

This essay was donated by a student on 30.11.2015 in exchange for a free plagiarism scan.

Leave a comment