Jaccard similarity set-based method for plagiarism detection

Summary:

Jaccard similarity measures textual plagiarism by comparing sets of k-grams or shingles between documents.
Effective at detecting exact copying, simple, scalable, language-independent, but struggles with paraphrasing.
Often combined with Winnowing fingerprinting for efficient, large-scale detection.
Accuracy depends heavily on chosen k-gram size and window parameters.

Plagiarism detection is a critical task in academic and technical fields, as it involves identifying copied or closely imitated text between documents. Various techniques have been developed to tackle this problem. One widely used method is based on the Jaccard similarity coefficient – a simple yet powerful set-based approach. In this method, documents are represented as sets of features – typically unique words or contiguous word sequences (shingles). The Jaccard coefficient is defined as the size of the intersection divided by the size of the union of the two sets. This ratio quantifies the degree of overlap between the two sets. Jaccard operates purely on set overlap, making it inherently language-independent. It can be applied to texts in any language after appropriate tokenisation. This has made it especially appealing for large-scale and multilingual plagiarism detection systems. The approach is widely used in conjunction with document fingerprinting algorithms (such as the Winnowing algorithm) to efficiently detect text reuse. In the following sections, we explain how Jaccard similarity is applied to plagiarism detection and examine its technical underpinnings. We then evaluate the method’s effectiveness, discussing its strengths and weaknesses in detail.

Set-based similarity detection with Jaccard coefficient

The Jaccard similarity coefficient $J(A,B)$ for two sets of textual features $A$ and $B$ is given by:

J(A,B)=∣A∩B∣ ∣A∪B∣ ,J(A,B) = \frac{|A \cap B|}{\,|A \cup B|\,},J(A,B)=∣A∪B∣∣A∩B∣,

where $|A \cap B|$ is the number of elements common to both documents (the intersection) and $|A \cup B|$ is the total number of unique elements across both documents (the union). A higher Jaccard value indicates a greater proportion of shared features. For example, a Jaccard similarity of 0.20 (20%) indicates that 20% of the combined unique shingles appear in both documents. This interpretability makes it easy for instructors or researchers to set practical thresholds. For example, one might flag any document pair with a Jaccard score above a chosen percentage as potentially plagiarised. It also allows a quick understanding of the extent of overlap between two documents.

Shingling and k-gram features

However, using individual words alone can be too naive – common words may inflate similarity, and word order or context is ignored. Instead, a more robust approach is to use shingling. This involves breaking each document into overlapping sequences of k words (called k-grams or shingles). Each distinct k-gram is then treated as an element of the set. Comparing sets of k-grams (rather than individual words) allows the Jaccard metric to capture longer matching phrases and in-order text overlaps. This property is crucial for detecting passages copied verbatim.

To illustrate, suppose Document A and Document B are converted into sets of 5-word shingles. If Document B contains a paragraph copied verbatim from Document A, many of those 5-word shingles will appear in both sets. These matching shingles contribute to a large intersection. Conversely, if two documents share little or no textual content, their shingle sets will intersect on very few elements (perhaps only common short words). In that case, the Jaccard value will be near 0. This indicates only minimal similarity.

One advantage of the Jaccard formulation is its clear interpretation. It directly reflects the fraction of shared features between two texts. In practice, users can choose a threshold Jaccard score to flag potential plagiarism and gauge the extent of overlap at a glance.

Document fingerprinting and Winnowing

Applying Jaccard similarity naively to entire sets of all k-grams can be computationally expensive for long documents and large databases. Plagiarism detection systems therefore often employ document fingerprinting techniques to condense the set representation. One popular fingerprinting method is the Winnowing algorithm, which was designed for efficient local text similarity detection.

Winnowing fingerprint algorithm

Winnowing works by hashing all k-grams in a document. It then selects a subset of these hash values (the “fingerprint”) according to a sliding window strategy. This drastically reduces the storage and comparison costs while retaining the most informative substrings. After fingerprinting, each document is represented by a set of fingerprints. Plagiarism detection then proceeds by computing the Jaccard similarity between the fingerprint sets of a suspicious document and a source document. If the Jaccard score exceeds a chosen threshold, the documents are deemed significantly similar (and potentially plagiarised).

A 2020 study by Puspaningrum et al. used k-gram based winnowing fingerprints in conjunction with the Jaccard coefficient to measure text similarity. Their experiments confirmed that tuning the fingerprint parameters can markedly affect detection results.

Tuning fingerprint parameters

Notably, they found that using smaller k-gram lengths yields higher similarity percentages for the same plagiarised text segment. A smaller k makes shingles more fine-grained, increasing the chance that copied content will produce matching shingles in both documents and thus boosting the Jaccard overlap. However, extremely small k-grams (e.g. 1 or 2 words) are generally avoided because they may match innocuous common phrases and generate false positives. In practice, moderate shingle sizes (such as 5–10 words) are chosen to balance sensitivity and specificity. The winnowing fingerprint window size is similarly tuned to capture contiguous runs of plagiarised text while filtering out noise.

By using fingerprinting, Jaccard-based detectors achieve both scalability and speed. Instead of comparing every possible shingle between every pair of documents, the system compares relatively small fingerprint sets. This approach scales well even to very large repositories. The Jaccard calculation itself is straightforward (essentially counting intersecting hashes). It can also be accelerated with efficient data structures or parallel processing. As a result, Jaccard similarity is routinely employed in large-scale plagiarism platforms that must compare thousands of documents across different languages.

Strengths of the Jaccard similarity approach

The Jaccard set-based method offers several compelling strengths for plagiarism detection:

Simplicity and interpretability

The metric is conceptually simple – it measures the percentage of shared unique features between two documents. Results are easy to interpret and explain. Unlike more opaque machine learning models, the Jaccard score directly indicates shared content. This transparency appeals to educators and researchers who need clear evidence of copying.

Language independence

Jaccard relies on token matching rather than linguistic analysis, so it can be applied to text in any language. The method does not require language-specific resources or semantic understanding. It can even be applied to source code or other sequential data by treating them as tokens. This highlights the method’s flexibility and makes Jaccard-based detectors naturally suited to multilingual plagiarism detection.

Robust detection of exact overlaps

Jaccard excels at catching verbatim copy-paste plagiarism. When a section of text is copied exactly or with only minor superficial changes, the overlapping shingles cause a high Jaccard score. In one comparative evaluation, Jaccard’s performance on detecting copy-paste plagiarism was superior to that of other metrics. It yielded the highest accuracy when identical passages were present. In practice, this translates to very few false negatives for straightforward plagiarism. If a student or author copies whole sentences or paragraphs, the Jaccard coefficient will almost certainly flag the documents as similar.

High precision (low false-positive rate)

Since the Jaccard index only counts explicit overlaps in content, it tends to be very precise in what it flags. Random or incidental resemblances between texts usually result in low Jaccard values. This is especially true when sufficiently large shingles are used and common stopwords have been removed. Thus, when a high Jaccard similarity is observed between two documents, one can be reasonably confident that there is a substantial shared text segment. This precision is one reason Jaccard-based methods are trusted in plagiarism detection workflows. A flagged high similarity almost always warrants a closer human inspection. Moreover, the strictness of the measure (requiring exact matching shingles) further reduces false alarms. It very rarely misidentifies two topically related but independently written texts as plagiarised, as long as they don’t actually share identical phrasing.

Scalability and efficiency

The operations needed to compute Jaccard similarity (set intersection and union) are efficient. They can also be optimized using hashing and sorting. The method is easily distributed across multiple processors or machines, and approximate techniques like MinHash can be used to further speed up similarity estimation for very large collections. Such scalability is crucial for modern plagiarism detection systems. They may need to compare each submission against tens of millions of source documents. Major plagiarism detection services (such as Turnitin) benefit from the efficiency of the Jaccard-fingerprinting approach.

Weaknesses and limitations

While Jaccard similarity is a powerful tool, it has notable weaknesses when used alone for plagiarism detection:

Insensitive to paraphrasing

The biggest drawback of an exact matching approach is its difficulty in detecting obfuscated plagiarism, such as paraphrasing. If a plagiarist paraphrases the source text – changing words to synonyms, altering sentence structure, or inserting extra words – the overlap in exact k-gram shingles drops dramatically. Jaccard will then output a low similarity despite the derivative nature of the content. For example, two sentences expressing the same idea in different wording may share few or no identical 5-word sequences. As a result, Jaccard-based detection has low recall for cleverly disguised plagiarism. Empirical studies confirm this limitation. For example, when tested on paraphrased plagiarism cases, Jaccard and other purely syntactic similarity measures performed poorly compared to semantic-based techniques. In practical terms, a determined plagiarist can often evade Jaccard-based detection by sufficiently rewording the copied material. Using synonyms for many words or changing the sentence order of a passage can drastically reduce the measurable overlap.

Vulnerability to document length differences

The Jaccard metric considers the size of the union of features, which means the relative length of documents can influence the similarity score. If a short document is entirely contained within a much longer document, there is substantial shared text; however, the union is dominated by the longer document’s extra content. This yields only a moderate or low Jaccard score. For instance, copying a single paragraph from a very large source might only result in about 0.1 (10%) similarity. The source’s remaining content dilutes the overlap fraction. In such cases, Jaccard may underestimate the significance of the match. Plagiarists can even exploit this effect by padding a stolen passage with lots of original text to decrease the overall overlap ratio. Thus, interpreting Jaccard requires some care. A low overall similarity does not always mean there is no plagiarism. It might simply mean that a small portion of one document was copied into a much larger one.

Does not account for semantic similarity or context

Jaccard operates purely on literal token matches and ignores meaning or context. It cannot detect cases where the same idea is expressed using different words. It also fails to capture structural or narrative similarities beyond shared shingles. Two documents could have identical structure or argument flow (indicating plagiarism of ideas), yet Jaccard would not notice without literal phrase overlap. Similarly, if one document is simply a translated version of another in a different language, a direct Jaccard comparison (without translation) would yield zero similarity. This narrow focus means that Jaccard should be complemented by other analysis methods to catch such cases.

Choice of k-gram size is critical

The effectiveness of Jaccard detection depends on selecting appropriate shingle length and fingerprinting parameters. Too large a k-gram might miss short copied fragments, reducing recall. On the other hand, too small a k increases the chance of coincidental matches and thus reduces precision. For example, using k = 3 (trigrams) on English text might yield high similarity scores even for unrelated documents. Very short common phrases like “in the” or “of the” appear in most texts and inflate the intersection size. To mitigate this, detection systems typically remove stop words and use a moderate k (or otherwise down-weight extremely common shingles). Nevertheless, Jaccard by itself does not incorporate any weighting. The raw measure can thus be skewed if parameters are not well-chosen or if the corpus contains a lot of boilerplate text.

Potentially high computational cost for exhaustive comparisons

Although Jaccard is simpler than many algorithms, a naive implementation that compares every document pair (with large shingle sets) could be slow. In the worst case, the method scales quadratically with the number of documents if every pair must be checked. In practice, this concern is alleviated by indexing fingerprints and narrowing down candidate pairs using heuristics. Modern systems use efficient data structures (e.g. hashed indices for shingles) and parallel processing to handle large corpora. This makes Jaccard-based comparison feasible even at web scale. Nonetheless, the computational aspect remains a consideration in system design.

Enhancements and complementary approaches

To overcome some of Jaccard’s limitations, researchers often combine it with other methods.

Combining Jaccard with other models

One successful strategy is to pair the high-precision Jaccard approach with a complementary high-recall approach that can catch rephrased content. A notable example is the combination of Jaccard with the Vector Space Model (VSM) based on TF–IDF vectors. The VSM represents documents in a continuous vector space of word frequencies and measures similarity via cosine similarity. It tends to be more sensitive to overall topical similarity, and it can detect when two texts share many uncommon words even if not in the same order or form. However, VSM alone can produce false positives (for example, flagging documents on the same subject that aren’t actually copied). Wang et al. (2013) illustrate the benefit of combining these approaches. They noted that Jaccard offers high precision while the VSM provides high recall. By integrating the two methods, they achieved better overall detection performance. In practical terms, a plagiarism detection workflow might first use Jaccard-based fingerprint matching to quickly and precisely identify obvious cases of copying. Then a secondary analysis using a vector-space or semantic similarity measure can inspect document pairs that Jaccard deemed dissimilar. This step helps catch more subtle cases of plagiarism, such as paraphrasing or structural similarity. Indeed, many modern systems adopt such a multi-stage strategy. An initial fingerprinting stage (using Jaccard or a similar algorithm) flags exact overlaps, followed by a detailed text-alignment stage that can detect rephrased or otherwise transformed plagiarism.

Alternative similarity coefficients

Another avenue to enhance recall is to experiment with alternative set similarity measures. The Jaccard coefficient is one of several ways to quantify set overlap. Other metrics like the Dice (Sørensen–Dice) coefficient or the Overlap coefficient can be used similarly. These measures weigh overlaps slightly differently. For example, the Dice coefficient is defined as $D(A,B) = \frac{2|A \cap B|}{|A| + |B|}$.

In some cases, Dice can yield higher similarity scores than Jaccard for the same pair of documents. A comparative study by Purwaningrum et al. (2021) found that the Dice coefficient produced a higher similarity score than Jaccard on the same document pairs. In their tests, Dice-based similarity averaged about 71%, whereas Jaccard averaged only ~36% for the identical plagiarised pairs. A higher similarity number is not inherently “better”, since one can always adjust the threshold accordingly. Nonetheless, these differences suggest that different similarity coefficients may be more suitable depending on the application. Some plagiarism detection frameworks even allow switching or comparing multiple similarity indices to see which highlights potential matches best. For example, one study found that a winnowing fingerprint approach coupled with a set overlap metric (Dice) outperformed a traditional TF–IDF cosine similarity method for document comparison. This result underscores the advantages of set-based metrics in identifying textual duplicates.

Conclusion

The Jaccard similarity method has proven to be a robust and interpretable technique for plagiarism detection, especially in identifying exact or near-exact text overlaps. By representing documents as sets of tokens or shingles, it reduces the problem of textual similarity to a set comparison and yields a clear quantitative measure of overlap. This set-based approach shines in cases of direct copying: it is precise and unlikely to flag false positives when substantial text is truly shared. Its simplicity allows it to be scaled up to very large document collections and applied across languages with minimal adjustments.

However, Jaccard similarity is not a silver bullet. Its reliance on literal matches means that it struggles with paraphrased or otherwise obfuscated plagiarism. In such cases, additional strategies – from combining Jaccard with vector-space models to employing semantic analysis or other NLP techniques – become necessary to ensure plagiarised content does not slip through undetected. Effective plagiarism detection systems often blend multiple techniques to leverage the strengths of each approach. Jaccard provides a solid foundation for catching blatant overlaps and establishing clear evidence of copying, while complementary methods broaden the net to capture more nuanced similarities.

In summary, the Jaccard set-based approach remains a cornerstone of plagiarism detection. This is due to its clarity, efficiency, and accuracy in detecting exact textual overlaps. It offers a strong first line of defence against plagiarism. When carefully tuned and augmented with other techniques, it contributes to a comprehensive solution for maintaining originality and integrity in written work.

References

Puspaningrum, E. Y., Nugroho, B., Setiawan, A., & Hariyanti, N. (2020). Detection of Text Similarity for Indication Plagiarism Using Winnowing Algorithm Based K-gram and Jaccard Coefficient. Journal of Physics: Conference Series, 1569, 022044. DOI: 10.1088/1742-6596/1569/2/022044. (In this study, the authors combined the winnowing fingerprinting algorithm with Jaccard similarity and showed that decreasing the k-gram size increases the reported similarity percentage.)
Purwaningrum, S., Susanto, A., & Prasetya, N. W. A. (2021). Comparation of Dice Similarity and Jaccard Coefficient Against Winnowing Algorithm for Similarity Detection of Indonesian Text Documents. Journal of Applied Intelligent System, 6(1), 10–22. Available at: https://publikasi.dinus.ac.id/index.php/jais/article/view/4453 (This paper compares set-based similarity measures and finds that the Dice coefficient can yield higher similarity scores than Jaccard in a fingerprinting context, with Dice averaging ~71% vs Jaccard ~36% on the same plagiarised document pairs.)
Wang, S., Qi, H., Kong, L., & Du, C. (2013). Combination of VSM and Jaccard Coefficient for External Plagiarism Detection. In Proceedings of the 2013 International Conference on Machine Learning and Cybernetics (ICMLC), vol. 3, pp. 1274–1279. DOI: 10.1109/ICMLC.2013.6890902. (Proposes a hybrid model that integrates a vector space model with the Jaccard measure. The authors report that Jaccard offers higher precision while the VSM offers higher recall, and combining them improves overall detection performance.)
Zouhir, A., El Ayachi, R., & Biniz, M. (2021). A Comparative Plagiarism Detection System methods between sentences. Journal of Physics: Conference Series, 1743, 012041. DOI: 10.1088/1742-6596/1743/1/012041. (A survey comparing plagiarism detection techniques. It notes that for straightforward copy-paste plagiarism, simple syntactic similarity measures like the Jaccard coefficient gave the best results, whereas they underperformed on paraphrased plagiarism where semantic measures were more effective.)
Khuat, T. T., & Nguyen, D. H. (2015). A Comparison of Algorithms Used to Measure the Similarity Between Two Documents. In Proceedings of the 2015 International Conference on Computational Science and Engineering. (The authors compare string-based and vector-based similarity methods, including a winnowing fingerprint with Dice coefficient versus a TF–IDF cosine similarity. They found that a fingerprinting method outperformed the cosine approach in detecting document overlaps, underscoring the advantages of set-based overlap metrics.)