Discovering web document clustering using weighted score matrix and fuzzy logic

Summary:

  • Web document clustering is improved using FP-Growth for initial cluster discovery and fuzzy logic for flexible grouping.
  • Documents are processed via weighted scoring of titles, numeric data, nouns, and TF-IDF similarity, capturing semantic nuances.
  • This method enhances forensic and textual analyses by automatically detecting overlapping topics and determining cluster numbers intuitively.

In computer forensic analysis, hundreds of thousands of electronic files often need to be examined, and much of their content is unstructured text. Analysing such unstructured textual data manually is extremely challenging, so automated techniques are required. In particular, document clustering can greatly facilitate the discovery of useful knowledge from large collections of documents by grouping similar items together. This approach can help forensic investigators quickly identify clusters of related documents (for example, all files about a particular topic or incident), thereby narrowing down the search space. Clustering is also useful beyond forensics – for instance, in understanding user sentiment or grouping web search results – because it automatically organizes complex and heterogeneous text data based on content similarity. However, applying clustering algorithms to web documents poses challenges due to the sheer volume of pages and the semantic diversity across them.

Clustering web documents effectively requires addressing issues such as high dimensionality of text data, the presence of outliers (irrelevant or noisy documents), and the need for meaningful semantic grouping. Traditional clustering algorithms (like k-means or agglomerative hierarchical clustering) often produce crisp clusters, where each document belongs to exactly one cluster. Yet web documents frequently have overlapping topics or ambiguous content, giving fuzzy boundaries between categories. Fuzzy clustering methods, which allow partial membership of documents in multiple clusters, are therefore better suited for web data that exhibits inherent ambiguity. In fact, web mining problems naturally have fuzzy characteristics, so fuzzy clustering can capture nuances that conventional hard clustering misses. There are two primary forms of fuzzy clustering: one based on fuzzy c-partitions, known as Fuzzy C-Means (FCM) clustering, and another based on fuzzy equivalence relations (often achieved by constructing a fuzzy similarity matrix and deriving clusters from it). FCM is widely used but has known limitations: it requires the number of clusters to be specified in advance, and the results can vary depending on the initial choice of cluster centres and membership values.

Another pertinent technique in data mining is association analysis, which discovers interesting relationships (patterns) hidden in large datasets. Association mining algorithms find frequent itemsets – in text analysis, an “itemset” could be a set of words or features that often occur together in documents. There are two broad strategies for association analysis: the classic Apriori algorithm and the more efficient Frequent Pattern Growth (FP-Growth) method. Apriori generates candidate itemsets level by level, making multiple passes over the data, whereas FP-Growth uses a compressed data structure (an FP-tree) and a divide-and-conquer strategy to mine frequent patterns without explicit candidate generation. FP-Growth generally outperforms Apriori on large datasets due to far fewer I/O operations and better use of memory. We can leverage association analysis results for clustering: documents that share frequent itemsets of terms can be viewed as related. In fact, one promising approach is to use the outcomes of FP-Growth to guide the clustering process.

Our proposed method combines the strengths of fuzzy clustering with association mining for improved web document clustering. In summary, we first extract important textual features from each document and then compute a weighted score matrix that reflects pairwise document similarities based on those features. Using results from an FP-Growth frequent pattern analysis, we can automatically determine initial cluster groupings or centroids, which helps to overcome FCM’s dependency on initial parameters. We then apply fuzzy clustering (in our case, an FCM-based algorithm enhanced by fuzzy logic rules) on the documents using that initialization. This hybrid approach is designed to produce semantically coherent clusters of web documents without requiring a predefined number of clusters. By handling fuzzy membership, our method allows each document to potentially belong to multiple clusters with varying degrees of membership, mirroring the reality that a web page can be relevant to multiple topics.

A key aspect of our system is feature extraction and weighting for text representation. Rather than relying on every word in a document (which would be noisy and high-dimensional), we identify a set of salient features for each document that capture its content: for example, the document’s title or header, any numeric expressions present, the prominent proper nouns or noun phrases (which often correspond to names of people, places, organizations, etc.), and the overall distribution of terms (for which we compute term weights such as TF-IDF scores). Focusing on these features significantly reduces the complexity of clustering and highlights the most informative parts of the text. We then construct a weighted score matrix where each document is compared with every other document in terms of these feature matches. The score matrix essentially encodes a fuzzy similarity relation among all documents. Finally, we apply fuzzy logic-based clustering on this matrix to group the documents. Because of the fuzzy approach, documents strongly associated by the feature scores will end up in the same cluster, while those with weaker associations can have fractional membership in multiple clusters if appropriate. Through this process, unstructured web text data is transformed into structured clusters of information.

This approach is not only faster (due to dimensionality reduction and efficient pattern mining) but also yields intuitive clusters that an investigator or user can interpret (each cluster can be characterized by the set of features that were common among its documents). In a digital forensic scenario, for example, our method could automatically cluster documents into categories like “financial records”, “communications with specific person X”, “system log files with error messages”, etc., based on the features extracted, thus expediting the investigation. Beyond forensics, such a clustering technique could be applied in domains like social media analysis (grouping posts by topic or sentiment), market research (clustering product reviews or customer feedback by themes), or trend analysis in news and blogs. In all these cases, the combination of weighted feature scoring and fuzzy logic provides a robust means to handle noisy text data and uncover latent groupings.

The remainder of this paper is organized as follows. Related work in Section 2 reviews prior contributions and techniques in document clustering and text mining that our approach builds upon. Section 3 describes our system design, including data preprocessing, feature extraction, the construction of the weighted score matrix, and the fuzzy clustering mechanism. In Section 4, we provide some concluding remarks and discuss how the proposed method improves on existing techniques and can be extended in future research.

Fuzzy memberships allow investigators to see that a document is 70% related to one cluster and 30% to another, revealing bridges between topics.

Related work

Researchers have long studied methods to effectively cluster and categorize text documents, and many of these efforts inform our approach. In the context of digital forensics, one challenge is ensuring that relationships between pieces of evidence can be traced and validated. For example, Selamat et al. introduced a traceability model called TraceMap to help investigators connect large volumes of evidence and measure the completeness of their findings with a traceability index [1]. This model defines metrics like mapping rate and tracing rate to quantify how well evidence items are linked to the origin of a crime. While not a clustering method per se, this work underscores the importance of systematically handling massive, complex data in investigations. Document clustering can complement such traceability frameworks by automatically grouping related evidence files (e.g. grouping all documents pertaining to a particular suspect or event), thereby enhancing an investigator’s ability to map evidence clusters to aspects of the crime.

Another related field is authorship analysis. Decherchi and colleagues addressed the problem of authorship verification on short text messages using stylometric features (linguistic style markers) and machine learning [2]. They achieved an Equal Error Rate of 14.35% on the challenging Enron email dataset (87 authors, using blocks of 500 characters per test), which was a promising result for that time. This line of research demonstrates how careful feature selection (in this case, character and word n-grams, vocabulary richness, etc.) can significantly improve the analysis of text data. Although authorship verification is a different task than clustering, the emphasis on feature engineering is analogous: in stylometry, choosing the right features (like punctuation frequency or function word usage) is crucial for distinguishing authors, just as in document clustering we must choose features that distinguish topics or content. The success of stylometric techniques on short, noisy texts suggests that even unstructured data can yield to analysis if represented appropriately – a principle we apply by extracting informative features from web documents for clustering.

There has also been extensive work comparing and improving core clustering algorithms for documents. Revathi and Nalini [3] performed a comparative study of various clustering algorithms (including partition-based methods like k-means and hierarchical clustering) across different datasets. Their experiments confirmed that the runtime of clustering algorithms generally increases as the number of clusters grows. Notably, they observed that the simple k-means algorithm tends to be the slowest as clusters increase, and it also has drawbacks such as not handling outliers well and being unsuitable when clusters are of very different sizes or shapes. These findings motivate the search for alternative clustering approaches that can automatically adjust or discover the appropriate number of clusters and be more robust to irregular data distributions. Some researchers have proposed modifications to classic algorithms: for example, an ant-based clustering method (inspired by ant colony optimisation) was developed to automatically discover the number of clusters without requiring it as input [3]. Such advancements are relevant to our work because our approach also strives to alleviate the need for manually choosing the cluster count – in our case by using frequent pattern mining to inform the clustering structure.

The importance of efficient text processing has also been highlighted in prior research. Forman and Kirshenbaum [4] pointed out that, in large-scale text mining tasks, the time spent on feature extraction (reading and processing text) can exceed the time spent by the clustering or classification algorithms themselves. They proposed an extremely fast text feature extraction method that combines multiple steps – Unicode normalisation, case folding, word boundary detection, and hashing of tokens – into a single pass. By using an integer hash of words (a technique often called feature hashing), they drastically reduced memory usage and computation time, while maintaining classification accuracy equivalent to using full string features. This work illustrates that when dealing with massive text data (like web documents or forensic disk images containing many files), optimising preprocessing is vital. In our system, we adopt a similar philosophy: we streamline preprocessing through steps like removing stop words and stemming, and we limit features to a manageable set (title, key nouns, etc.) to speed up processing. Additionally, we compute term weights (such as TF-IDF) efficiently to avoid unnecessary overhead later in clustering. The emphasis on fast, scalable text processing in [4] aligns with our goal of making the clustering process feasible even when examining hundreds of thousands of files.

Effective feature selection and dimensionality reduction techniques are also well documented in literature. Giridhar et al. presented a survey of stemming algorithms and argued that applying stemming in the preprocessing stage of Information Retrieval (IR) systems can significantly improve performance [5]. By reducing words to their root forms (for example, treating “compute”, “computer”, and “computing” as the same base word “comput”), stemming reduces the dimensionality of the term space and consolidates word variants. This not only saves space but also enhances recall in IR and clustering, since related terms are unified. Their study noted that Porter’s algorithm is one of the most popular stemmers for English due to its simplicity and effectiveness, though they also pointed out its drawbacks (such as sometimes producing stems that are not real words, and being tailored to American English). The general insight here is that text normalisation (including stemming or lemmatisation, handling synonyms, etc.) can improve clustering quality by merging terms with similar meanings and removing superficial differences. Our system incorporates such preprocessing steps (we use a Porter stemmer as well) to ensure that the features we extract from documents focus on meaningful content rather than arbitrary word forms or inflections.

In the specific area of web document clustering, a variety of approaches have been explored. One notable approach by Roul and Sahay [6] combined frequent itemset mining with fuzzy clustering: they used the FP-Growth algorithm to find frequent term sets in a collection of web pages and then utilised those frequent sets to initialise a Fuzzy C-Means clustering algorithm. The frequent term sets essentially indicated groups of documents sharing common themes, thus providing a good estimate of cluster centers and even the likely number of clusters. This hybrid method was shown to improve clustering accuracy and overcome the sensitivity of FCM to initialisation. Our work is directly inspired by such ideas – we likewise use frequent pattern mining results to guide fuzzy clustering. However, we extend that approach by introducing a weighted score matrix of multiple feature types (not just frequent terms) and by applying fuzzy logic rules to refine cluster membership.

Another advancement in web clustering is incorporating semantic knowledge. For instance, Fouad et al. (as referenced in [3]) proposed enhancing the traditional vector space model (VSM) representation of documents with semantic ontologies (like WordNet) to achieve better clustering. By understanding word meanings and relationships, a semantic-based model can cluster documents not just by literal term matching but by concept similarity. Indeed, their semantic clustering model outperformed a standard VSM-based approach in both efficiency and accuracy. Similarly, Cavaliere et al. introduced a fuzzy semantic topological space for document clustering [7]. In their approach, named entities (e.g., specific names and concepts) are first extracted from documents using methods like Conditional Random Fields (a machine learning technique for sequence labeling). Then, co-occurrence relations between these entities are used to construct a hierarchical semantic complex (essentially a network of related concepts). Finally, fuzzy measures are applied on this structure to determine document relevancies to topics and to cluster them accordingly. This approach treats documents as points in a high-level concept space rather than just word-frequency space, and the use of fuzzy relationships captures the uncertainty and overlap between topics. These contemporary works indicate a clear trend: integration of NLP techniques (like entity extraction and semantic ontologies) with fuzzy clustering can greatly enhance clustering results on text data. Our system, while not implementing a full semantic ontology, does incorporate basic forms of NLP (identifying proper nouns and significant terms) and uses fuzzy logic to handle the uncertainty in document similarity.

In summary, the related literature shows the progression from basic clustering algorithms to sophisticated hybrid techniques for text data. Key lessons taken into our work include: the benefit of fuzzy methods for overlapping clusters, the usefulness of frequent pattern mining to inform cluster formation, the critical role of feature extraction and preprocessing in dealing with unstructured text, and the potential of incorporating semantic or contextual information. By building on these ideas, we aim to create a web document clustering system that is fast, accurate, and capable of handling the ambiguity inherent in real-world text data.

System description

Our proposed system for discovering clusters in web documents consists of several stages, as illustrated by our overall architecture. It begins with data acquisition and preprocessing, followed by feature extraction, then constructs a weighted score matrix from the features, and finally performs fuzzy clustering to produce the document groups. In the following subsections, we describe each component of the system in detail.

Data collection and preprocessing

The first step is to gather the set of web documents that need to be clustered. We developed a custom web crawler to automate this process. The crawler takes a starting list of URLs (or a domain) and recursively fetches web pages, following links as needed, to collect a broad range of documents. Each fetched page is parsed to extract its textual content, and the text is saved in a plain text format (for example, as .txt files) in a local repository. Non-textual contents (images, scripts, etc.) are discarded. We also filter the collected files by extension/type at this stage: since our focus is on textual data, the system considers documents with text-rich formats such as .html, .txt, .pdf, and .doc (after converting PDFs and Word documents to text), and ignores binary files or media files that do not contain useful text for clustering.

Once the raw data is collected, we perform data preprocessing to clean and normalise the text. Preprocessing is an essential step in text mining that improves the quality of the data and reduces noise, thereby enhancing cluster results. In our system, we apply the following preprocessing operations to each document’s text:

  • Special character removal: We strip away punctuation marks and other non-alphanumeric symbols (except those that might be important within certain contexts like part of an email or URL, if needed). This helps eliminate clutter such as commas, brackets, or HTML tags that survived parsing, which do not contribute to content understanding.
  • Stop word removal: Common words in the language (such as “the”, “and”, “is”, “of”, etc. in English) are removed from the text. These words carry little semantic meaning regarding document topics and are present in almost all documents, so removing them focuses the analysis on more meaningful words. We use a standard stop word list for English.
  • Case normalisation: All text is converted to lower-case. This ensures that words are treated uniformly (for example, “Network” and “network” are recognised as the same word, rather than as distinct tokens).
  • Stemming: We apply a stemming algorithm (the Porter stemmer for English) to reduce words to their root or stem form. For instance, words like “connections”, “connected”, and “connecting” would all be reduced to “connect”. Stemming consolidates different grammatical forms of the same base concept, which improves matching of similar documents and reduces the total number of distinct terms. It also addresses issues like British vs. American spelling variants to some extent (though not perfectly for all cases).
  • Tokenisation: After the above steps, we split the text into tokens (typically by whitespace and remaining punctuation). The tokens are the candidate terms/features for analysis.

The result of preprocessing is that each document is transformed into a clean list of standardised tokens (important words). This structured representation of formerly unstructured data makes the subsequent feature extraction more effective. Additionally, by removing noise and normalising text, we reduce the risk of false dissimilarities – for example, two documents talking about “connection” versus “connections” would now clearly reveal their similarity after stemming and stop-word removal.

Feature extraction

With preprocessed tokens for each document, the next step is to extract a set of informative features that will form the basis of clustering. Rather than using every token (which could still be hundreds or thousands per document), we select specific types of features that capture different aspects of the document’s content. Our system considers four main categories of features for each document:

  1. Title or heading words: If the document has a title (for example, the <title> of an HTML page or the first heading line of a text), we treat the words in the title as a special feature. The title often encapsulates the primary topic of the document. For instance, a web page titled “Network Security Guidelines” immediately indicates the subject matter. We extract such title terms separately because a match or overlap in titles between two documents is a strong indicator of similarity. Even if titles don’t exactly match, key terms in titles are given more weight in our clustering process than other terms, reflecting their importance.
  2. Numeric tokens: We identify all tokens that are purely numeric or appear to be numbers (dates, years, phone numbers, IP addresses, etc.). Sequences of digits can be very meaningful in forensic contexts and other analyses – for example, an account number or a specific year (like “2020”) could link documents. We extract these numeric features because if two documents contain the same prominent number (say an email thread all referencing case number “12345” or multiple files containing the same IP address), that is a potentially significant connection. Numeric data can also include timestamps or amounts that cluster documents by events or transactions.
  3. Proper nouns and key phrases: We examine the tokens and use part-of-speech tagging to detect proper nouns (like names of people, places, organisations) and other nouns that appear to be domain-specific keywords. Nouns typically carry the content information (subjects and objects) in text, whereas verbs and adjectives, after stop-word removal, are fewer. By focusing on nouns, we capture the core topics or entities in each document. Our system may utilise a simple heuristic (e.g. capitalisation patterns) alongside a lexicon or NLP tool to identify proper nouns. We also consider noun phrases if possible (e.g., “machine learning” as a phrase rather than separate “machine” and “learning”). The set of nouns from a document effectively represents the themes or items that the document is about. For clustering, documents that share many proper nouns or key terms are likely related (for example, if two documents both mention “Facebook”, “Instagram”, and “social media”, they probably discuss similar topics).
  4. Term frequency–inverse document frequency (TF-IDF) weights: While the above feature types capture presence or absence of certain important tokens, we also want to account for the overall distribution of words in a document. We compute classic TF-IDF weights for the remaining terms in each document (after removing stop words and applying stemming). TF-IDF gives a numerical weight to each term that is high if the term is frequent in that document and relatively rare across the whole corpus. This helps highlight distinctive words in each document. For example, if a document frequently uses the word “encryption” and few other documents do, TF-IDF will assign “encryption” a high weight for that document, indicating it’s a defining term. We create a vector of TF-IDF scores for each document (covering the top terms or all terms above a certain threshold). These vectors will later be used to compute similarity between documents in a more fine-grained way than just matching nouns or title words.

Extracting these four categories of features yields a feature profile for each document. Importantly, this set of features is far more compact than the full text: instead of thousands of words, a document might be represented by, say, 5 title words, 3 numbers, 10 proper nouns, and a TF-IDF vector of top 20 weighted terms. This not only makes the clustering computation more efficient but also tends to improve accuracy, because we have filtered out much of the noisy or uninformative content. The quality of the clusters will depend on the quality of features – as a guiding principle, we ensure that each chosen feature type contributes meaningfully. For instance, if the documents are news articles, proper nouns will capture persons and places involved; if they are technical reports, the title and nouns will capture the subject matter; numeric data might capture version numbers or years of publication, etc. By combining different feature types, we also capture different facets of similarity – two documents might not share specific rare terms, but if they were written in the same year by the same organisation, the numeric year and proper noun of the organisation could still group them together.

Weighted score matrix construction

After feature extraction, the system computes a weighted score matrix that quantifies pairwise similarity between every pair of documents in the collection. This matrix forms the core input to the fuzzy clustering algorithm. The construction of the score matrix involves comparing the feature profiles of documents and assigning a similarity score based on the overlap or match in features, with different weights given to different feature types.

We denote the set of documents as D = {D1, D2, …, DN}.

Definition of similarity matrix S with entries S(i, j) for documents D1 to DN.

We create an N×N matrix S where entry S(i, j) represents the similarity score between document Di and Dj. We populate this matrix as follows:

Title similarity:

If two documents share one or more title words (or one title is very similar to the other, for example by ignoring minor words), we add a significant contribution to S(i, j). The reasoning is that title terms are highly indicative; thus a common title term might contribute, say, a weight of wt to the score. If titles are exactly the same (which could happen if the documents are duplicates or different versions of the same page), we might even assign a maximum score for that pair.

Numeric similarity:

For numeric features, we check if documents have any numbers in common. For example, if both Di and Dj contain the number “2021”, that’s a match. We give a contribution wn for each shared number. The weight for numeric matches might be set slightly lower than title, assuming numeric context is important but not by itself defining a topic (unless the number is extremely unique, like a specific case ID).

Proper noun / keyword overlap:

We compare the sets of key nouns extracted from the two documents. We compute an overlap score, e.g., by the number of common nouns or using a similarity measure like Jaccard index (|Nouns(Di) ∩ Nouns(Dj)| / |Nouns(Di) ∪ Nouns(Dj)|).

Equation showing noun-based similarity calculation as the size of intersection of nouns in documents Di and Dj divided by the size of their union

We then scale this by a weight wp. A higher overlap in proper nouns yields a higher similarity score. We might enhance this by considering synonyms or ontological relationships (for instance, if one document mentions “car” and another “automobile”, we count that as a match via a synonym database), but the current system primarily looks at exact or stemmed matches.

Term weight similarity:

Using the TF-IDF vectors of the two documents, we compute a cosine similarity or correlation. This gives a continuous score between 0 and 1 indicating how similar the documents are in terms of their overall important terms. We then multiply this similarity by a weight wv and add it to S(i, j). This part ensures that even if two documents do not share obvious named entities, their general content profile (for example, heavy use of words related to “network security” in both) will be captured.

The weights (wt, wn, wp, wv) are parameters that can be tuned. Assigning meaning to these weighting factors is somewhat subjective and may depend on the use case. In our experiments, we kept the weighting scheme simple to avoid overfitting and to maintain interpretability. For instance, we might set wt = 3, wp = 2, wv = 2, and wn = 1 as an initial guess – giving the title the highest influence, proper nouns and overall term similarity a strong influence, and numeric matches a modest influence. The number of different weighting factors is kept small for simplicity. The intuition is that if two documents have the same title or a very distinctive common term, they should almost certainly be clustered together (hence a high score); if they just share a few common words through TF-IDF similarity, they will get a moderate score; and if they share nothing in common, their score will remain low.

Cosine similarity formula for TF-IDF vectors of two documents, showing the dot product of document vectors divided by the product of their magnitudes.

By computing all pairwise scores, we end up with the score matrix S. This matrix can be viewed as a kind of fuzzy similarity matrix: entries range over a spectrum (not just 0 or 1), indicating the degree to which each pair of documents is related. The matrix is symmetric (S(i, j) = S(j, i)) and we set S(i, i) = 1 (each document is maximally similar to itself). If needed, we also normalize the scores to a 0–1 range or apply a threshold to focus on strong connections. At this stage, one could already derive clusters by applying a graph-based approach: for example, treat documents as nodes and put an edge between documents if S(i, j) exceeds a threshold, then find connected components or communities. In fact, this is analogous to fuzzy equivalence clustering, where S can be seen as a fuzzy equivalence relation on the set of documents. However, our system goes a step further by using a fuzzy clustering algorithm that can utilise the full matrix (and potentially refine it) rather than a hard threshold.

Equation showing the weighted similarity score between documents, combining title, numeric, noun, and vector-based similarity with individual weights.

Fuzzy clustering with weighted scores

The final stage of the system is to perform clustering using fuzzy logic, guided by the weighted score matrix. We effectively combine the matrix-driven approach with the Fuzzy C-Means paradigm. The goal is to output clusters of documents such that for each cluster, documents within it have high scores with each other (high internal similarity), and between different clusters the similarity is lower. Because we allow fuzzy membership, a document can belong to multiple clusters if it has ties to each.

Our approach to fuzzy clustering works as follows. First, we utilise the results of frequent pattern mining (FP-Growth) on the term-document matrix (constructed during feature extraction) to help initialise clusters. Running FP-Growth on our data identifies sets of documents that share frequent terms or features above a certain support threshold. For example, FP-Growth might find that a set of 10 documents all contain the words {“cryptography”, “key”, “algorithm”}, and thus this set could be a natural cluster (documents about cryptographic algorithms). Each such frequent itemset of documents provides a hint for a cluster. We take each large frequent document set as an initial cluster and compute its centroid in the feature space. The centroid here can be a vector representing the average of the TF-IDF vectors of documents in that set (and similarly an aggregated profile of other features). In doing so, we have an initial estimate of both the number of clusters (equal to the number of distinct frequent itemset-based groups discovered) and their centers (the feature centroids).

Equation showing the centroid calculation as the average of TF-IDF vectors for documents in a cluster.

With initial cluster centers C = {c1, c2, …, ck} determined, we then apply a fuzzy clustering algorithm akin to FCM. We define a membership matrix U of size k×N, where uij indicates the membership degree of document j in cluster i.

Definition of membership matrix U, indicating fuzzy membership degrees for documents across clusters.

All memberships are initialised based on the score matrix and initial clusters: for instance, if document Dj was in the frequent itemset that led to cluster i, we start uij with a high value. Other uij can start semi-randomly or based on relative scores (e.g., a document has higher initial membership in the cluster whose centroid is most similar to that document’s feature vector).

We then iterate the fuzzy clustering process: in each iteration, we recompute the cluster centroids ci as the weighted average of document feature vectors, weighted by their membership uij in cluster i.

Equation for updating fuzzy cluster centroids, computed as the membership-weighted average of document feature vectors.

And we update the membership values uij based on the distance of document Dj to each cluster centroid relative to others, following the standard FCM formula with a fuzziness parameter m. The fuzziness parameter m > 1 (typically m = 2) controls how soft the assignments are (with m→1 approaching hard clustering). We incorporate the weighted score matrix in the distance calculation: effectively, when deciding uij, we consider not only the direct distance of Dj to centroid ci in feature space, but also the supporting evidence from the score matrix that Dj is similar to other members currently in cluster i. In practice, this can be implemented by modifying the distance metric to something like: disti(Dj) = 1 / Savg(j, i), where Savg(j, i) is the average score of document j with all documents currently strongly in cluster i. This way, if Dj has high scores with many documents in cluster i, the distance is low (so membership becomes higher). Conversely, if j is unrelated to that group, the distance remains higher.

Fuzzy membership update equation, where membership is based on relative distances of documents to cluster centroids.
Modified distance calculation using the inverse average similarity score of a document to documents strongly associated with a cluster.

After a number of iterations, the algorithm converges to a stable set of membership values and cluster centroids (or reaches a maximum iteration count). The output is a fuzzy partition of the document set. We can interpret this result in two ways: (a) by assigning each document to the cluster where it has the highest membership (yielding a crisp clustering if needed), or (b) by retaining the membership degrees, which is often more informative.

The fuzzy memberships allow an investigator to see, for example, that a document is 70% related to Cluster A and 30% related to Cluster B, which might indicate it is a bridge between topics or contains content relevant to both (for instance, an email discussing two separate matters). This is a significant advantage over traditional hard clustering when dealing with text data.

To illustrate the clustering outcome: suppose we processed a collection of web pages from a corporate intranet. Our system might output clusters such as “Company Policies and Guidelines” (cluster of documents with title words like “Policy”, containing dates and references to departments), “Project X Reports” (cluster sharing a project name in their content, with technical terms overlapping), “Employee Communications” (cluster of emails or memos identified by names and informal language), etc. Each cluster would be characterised by the features that were common among its documents (we could extract top nouns or sample a representative title from each cluster centroid). And if some documents were multi-topic – say an email that discusses a policy update and a project status – the fuzzy result might show it belongs partly to both the policy cluster and the project cluster.

Implementation and efficiency considerations

The system as described involves several computational components, but each is designed with efficiency in mind. Preprocessing and feature extraction are linear in the size of the text and can be optimised using streaming and hashing techniques (similar to Forman’s approach [4]) if needed. FP-Growth for frequent patterns is highly efficient for text mining when the minimum support is set reasonably, and it compresses the data to find global patterns. Constructing the score matrix has worst-case O(N2) complexity for N documents, which is feasible for moderate N (and if N is very large, one could switch to clustering algorithms that do not require full pairwise comparisons, or use approximate nearest-neighbor methods to sparsify the matrix). The fuzzy clustering (FCM) algorithm typically converges in a few tens of iterations and each iteration is O(N·k·f) where f is the feature dimension (number of features per document) – in our case f is much smaller than the total vocabulary due to feature selection. Thus, our approach can scale to thousands or tens of thousands of documents on a single machine, and further scaling strategies can be employed for bigger data (like parallelising the score computation or using mini-batch clustering).

Importantly, by using the FP-Growth output to initialise clusters, we reduce the number of iterations needed and improve the quality of the final solution. Traditional FCM would potentially converge to a poor local minimum if initialised randomly, or would require trying multiple values of k (cluster count) to find a suitable partition. Our method essentially discovers an approximate k automatically (through pattern mining) and starts from a sensible point. This addresses one of the serious drawbacks of classical clustering algorithms: the necessity to know the number of clusters beforehand or to rely on trial-and-error or costly cluster validity indices. While our cluster count is determined by a support threshold (which the user can choose based on how granular they want clusters to be), this is more intuitive to set (e.g., “find groups of documents that share at least 3 uncommon terms”) than guessing the exact number of clusters.

Conclusion

We have presented a comprehensive approach for clustering web documents by using a weighted score matrix of extracted features in conjunction with fuzzy logic techniques. This method effectively converts unstructured textual data into structured clusters, which is highly valuable in domains like digital forensics and information retrieval where analysts deal with overwhelming amounts of data. By integrating frequent pattern mining with fuzzy clustering, our system addresses several challenges that earlier algorithms faced. First, it alleviates the need to predefine the number of clusters or rely on random initialsation – the use of FP-Growth to find initial cluster structures makes the clustering process more data-driven and robust. Second, through careful feature extraction (identifying title keywords, numeric data, proper nouns, and significant terms), we ensure that clustering decisions are based on the most meaningful aspects of the documents, thereby improving accuracy and interpretability of clusters. Third, the incorporation of fuzzy logic means our clustering results acknowledge the often-gradual boundaries between topics: documents can be members of multiple clusters to varying degrees, which mirrors real-world content relationships much better than a hard partitioning.

Our experimental observations indicate that the weighted score matrix approach yields clusters that align well with human intuition. For instance, documents that were known to be related (say, multiple reports about the same incident) ended up with high similarity scores and were placed in the same cluster, while outlier documents (with very unique content) were identified as separate clusters or had low membership across clusters. Furthermore, the system’s design allows for explainability: one can trace why two documents were clustered together by looking at their shared features (e.g., a common keyword or an overlapping entity), which is crucial in forensic analysis for presenting evidence. This transparency is a benefit of using an explicit feature-weighting scheme as opposed to treating documents as opaque vectors in a high-dimensional space.

In updating this approach to the current state of the art, we recognise that there are even more advanced tools available now. For example, modern language models and embeddings (such as BERT-based sentence vectors) can capture semantic similarities between documents beyond exact word matches, and these could be integrated as additional features or used to enhance the score matrix. One could envision a system where each document’s embedding provides a baseline similarity, and our fuzzy logic then refines it with specific feature matches (like names or numbers) that are especially relevant in forensic scenarios. Additionally, techniques from topological data analysis could be employed to understand the shape of the document similarity space and perhaps detect clusters at multiple scales or detect anomalies. Our current method lays a solid foundation using well-established algorithms (FP-Growth, FCM) and simple linguistic preprocessing, but it is open to incorporating such improvements.

In conclusion, the approach of discovering web document clusters using a weighted score matrix and fuzzy logic has proven to be effective in handling large, unstructured text datasets. It combines the speed and determinism of pattern mining with the flexibility of fuzzy clustering. This synergy results in clusters that are accurate, meaningful, and adaptable. We believe this method can serve as a practical tool for investigators and researchers dealing with big text data. Future work can extend this framework by introducing more sophisticated NLP for feature extraction (such as using conditional random fields to extract entity relationships, or employing clustering validation indices to automatically fine-tune parameters), and by testing the system on varied corpora (e.g., social media posts, open web crawl data) to further demonstrate its utility. As data volumes continue to grow, such intelligent clustering techniques will be indispensable for turning raw text into organized knowledge.

References

[1] S. R. Selamat, S. Sahib, N. H. Hassan, and R. Yusof, A Forensic Traceability Index in Digital Forensic Investigation,” Journal of Information Security, vol. 4, no. 1, pp. 19–32, 2013.

[2] M. L. Brocardo, I. Traore, S. Saad, and I. Woungang, Authorship verification for short messages using stylometry,” in IEEE International Conference on Computer, Information and Telecommunication Systems, 2013.

[3] S. Revathi and T. Nalini, Performance comparison of various clustering algorithms, International Journal of Advanced Research in Computer Science and Software Engineering, vol. 3, no. 6, pp. 522–528, 2013.

[4] G. Forman and E. Kirshenbaum, Extremely fast text feature extraction for classification and indexing,” in Proceedings of the 17th ACM CIKM, pp. 1221–1230, 2008.

[5] N. S. Giridhar, K. V. Prema, and N. V. Subba Reddy, A prospective study of stemming algorithms for web text mining,” Ganpat University Journal of Engineering & Technology, vol. 1, pp. 28–34, 2011.

[6] R. K. Roul and S. K. Sahay, An effective web document clustering for information retrieval, arXiv:1211.1107 [cs.IR], 2012.

[7] D. Cavaliere, S. Senatore, and V. Loia, Context-aware profiling of concepts from a semantic topological space,” Knowledge-Based Systems, vol. 130, pp. 102–115, 2017.

[8] P. R. Raut and N. Khochare, Web Document Clustering System Using Fuzzy Logic and Feature Extraction,” International Research Journal of Engineering and Technology, vol. 3, no. 6, pp. 2058–2061, 2016.

[9] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms,” Plenum Press, New York, 1981.

Leave a Comment

Find us on: