Source code similarity checkers: current methods and challenges

Summary:

Traditional code similarity checkers use tokenisation, abstract syntax trees (ASTs), and fingerprinting to identify copied source code.
Advanced techniques such as machine learning and behavioural analysis significantly improve accuracy and detection robustness, overcoming limitations of traditional methods.
Despite improvements, challenges remain for code similarity checkers, including code obfuscation, cross-language plagiarism detection, computational scalability, and creating user-friendly detection tools.

Plagpointer plagiarism checker effectively identifies AI-generated code, code plagiarism, modified code, and licensing alerts. With one robust solution, avoid potential infringement.

Source code plagiarism – the unacknowledged reuse of someone else’s code – is a persistent issue in both education and the software industry. In academia, instructors face growing class sizes and increasingly diverse programming assignments, making manual plagiarism checking impractical. Students who plagiarise often disguise copied code using various obfuscation techniques, such as renaming variables, altering comments or formatting, or reorganising program logic, so detecting copied code is far from trivial. Unlike natural-language plagiarism, source code plagiarism demands code similarity checkers that account for syntax and semantics across different programming languages.

Therefore, specialised plagiarism detection methods have been developed for code similarity checkers to compare programs and identify undue similarities beyond simple text matches. This article provides a detailed review of current methods for source code plagiarism detection, discussing traditional approaches, advanced techniques, and ongoing challenges. It focuses on how these methods work, their effectiveness (including where they succeed or fail), and emerging research directions.

Challenges in detecting plagiarised code

Detecting plagiarised code is challenging because copied programs can be modified in many superficial ways without changing their core functionality. Obfuscation tactics range from simple renaming of identifiers and reformatting, to reordering code blocks or replacing control structures with equivalents (for example, a loop turned into a recursive function). Such transformations often defeat naïve text-matching. For example, a widely-used tool like MOSS (Measure of Software Similarity) can report as low as 40% similarity on code that is clearly plagiarised after being obfuscated, misled by the cosmetic changes. This demonstrates that purely textual or line-by-line comparisons are insufficient, and more robust strategies are required for code similarity checkers.

Another challenge is the multitude of programming languages and paradigms – a detection technique may need to be language-agnostic or support multiple languages to be broadly useful. Efficiency and scalability are also concerns: in a large programming class or a big code repository, pairwise comparison of every submission or file is computationally expensive. Therefore, methods must strike a balance between thorough analysis and feasible performance. Additionally, legitimate similarities (such as common boilerplate code or use of standard libraries) must be distinguished from suspicious similarities. This introduces a need for significance filtering – identifying which matching code fragments are unlikely to occur independently by chance. As noted by Novak et al. (2019), a systematic solution requires careful consideration of what constitutes plagiarism and which similarities are meaningful. The burden of proof is also high: instructors or auditors must be able not only to detect plagiarism but to convincingly demonstrate it, meaning the detection tools should provide human-interpretable evidence of copying. Because of these challenges, source code plagiarism detection has evolved into a rich area of research, combining insights from software engineering, information retrieval, and machine learning to improve robustness and reliability.

Traditional detection approaches

Early and traditional approaches to code similarity checkers mostly operate on the program text or its lexical structure. They aim to identify textual or structural similarity between programs while tolerating minor differences. Broadly, these methods can be categorised as fingerprinting, string matching, or parse-tree based algorithms. Traditional tools often use a combination of such techniques to balance accuracy and speed.

Lexical fingerprinting and string matching:

One classical strategy is to reduce each source code file to a simplified textual representation (such as a token sequence or a fingerprint set) and then look for overlaps between these representations. For instance, the widely used academic tools MOSS and JPlag both tokenise the code and then apply substring matching algorithms to find common segments. MOSS uses a winnowing fingerprinting algorithm that selects representative substrings (fingerprints) from the token sequence, enabling efficient comparisons across many files. JPlag, on the other hand, employs a greedy string tiling algorithm (specifically, a Running Karp–Rabin Greedy String Tiling, or RKR–GST, approach) to detect maximal matching substrings between two token streams. The RKR–GST technique, originally based on the work of Michael Wise, is effective at finding long contiguous matches even if intervening code has been changed. Tools based on string matching are generally fast and language-independent (since tokenisation can be done for any language), and they handle simple syntax changes or renamings well. Deimos, for example, is a plagiarism detector introduced by Kustanto and Liem (2009) that combines tokenisation with the RKR–GST algorithm, achieving efficient and language-agnostic detection. Such methods can readily identify cases where a student has copied large chunks of code verbatim or with slight edits. However, purely lexical approaches can be thwarted by more complex obfuscations that reorder or restructure code without large contiguous common substrings.

Structural and syntax-aware methods:

To catch smarter plagiarism attempts, traditional systems also incorporate syntactic structure analysis. An approach here is to use the program’s Abstract Syntax Tree (AST) – a tree representation of the code’s parsed structure – and compare these trees for similarity. By using ASTs, the detector can ignore superficial differences like formatting or certain renamings and focus on the shape of the code (the sequence of operations, control flow, etc.). Researchers have demonstrated that AST-based plagiarism detection can effectively detect structural similarities even when the code has been reorganised or uses different variable names (Mei, 2011). In an AST-based method, two programs with identical underlying parse trees (or subtrees) will be flagged as similar, even if the surface text differs. This approach improves recall for cleverly disguised plagiarism, and experiments have shown it scales to languages such as C, C++ and Java by converting code to language-agnostic parse-tree representations (Mei, 2011). Another structural approach involves comparing program dependency graphs (PDGs), which represent semantic dependencies like data flows and control flows. A notable example is GPLAG by Liu et al. (2006), which detects plagiarism by analysing program dependence graphs. PDG-based detection can catch cases where code structure (in terms of flow of execution) is copied, even if the line-by-line structure is altered. The downside of AST and PDG methods is that they tend to be more computationally expensive than simple text matching, and they often require robust parsing front-ends for each programming language (which can limit language-independence). Nonetheless, they marked a significant step in handling non-trivial code modifications.

Metric-based and attribute counting approaches:

Some traditional techniques compare programs by computing various software metrics or attributes from code (such as counts of specific tokens, depth of nesting, number of loops, etc.) and then measuring similarity in this metric space. By comparing metric vectors, one can detect similarity in the “profile” of two programs. While faster and language-flexible, metric-based approaches alone can be imprecise – different code can coincidentally share similar metric profiles. However, metrics can augment other methods; for example, early work by Parker and Hamblen (1989) explored algorithmic comparisons of code metrics, and more recent studies have included metrics like whitespace patterns or comment text similarity as features to improve detection. Overall, the traditional arsenal of plagiarism detectors combined string-based matching with structural analysis to catch most straightforward cases of copying. Surveys such as the systematic review by Novak et al. (2019) catalogued these tools and noted how they perform under various common obfuscation methods. The consensus has been that no single technique is foolproof – each has strengths and blind spots – so plagiarism detection systems began to integrate multiple analyses and also explore more intelligent approaches, as discussed next.

Advanced and modern approaches

In the past decade, researchers have increasingly applied machine learning and more sophisticated analyses to improve the plagiarism detection abilities of code similarity checkers. These modern approaches aim to capture the semantics or deeper patterns of code beyond surface syntax, making detectors more robust to manipulation. We discuss three notable trends: machine learning and deep learning methods, dynamic program behaviour analysis, and hybrid enhanced tools that integrate multiple techniques (often with user-friendly features).

Machine learning and deep learning methods:

Inspired by advances in natural language processing, researchers have turned to machine learning to detect code similarity in a more generalisable way. Instead of manually specifying which textual or structural features to compare, the idea is to let algorithms learn the relevant features from data (i.e. from known cases of similar and dissimilar code). For example, Akhil Eppa and Anirudh Murali (2022) explored a suite of machine learning models – including K-Nearest Neighbours, Support Vector Machines, and deep neural networks (such as recurrent neural networks and even transformer-based models) – for plagiarism detection on C programming assignments. These models were trained on representations of code and were shown to outperform traditional text-matching detectors in accuracy (Eppa & Murali, 2022). A key reason for this success is that machine learning models can capture subtle similarities that rule-based methods miss; for instance, a neural network can learn to recognise two code fragments as implementing the same algorithm even if they use different syntax. In an earlier seminal work, Yasaswi et al. (2018) introduced one of the first deep learning-based systems for code plagiarism detection. They extracted various source code metrics and also trained a character-level recurrent neural network to learn features of code structure. Their system proved robust to code obfuscation, significantly improving detection rates compared to MOSS – for instance, it achieved a recall of 92.4% on a test set, whereas MOSS (under its usual similarity thresholds) only reached 81–63% recall. This demonstrated that learned features can catch many plagiarised pairs that signature-based tools overlook. More recently, Mehsen and Joshi (2024) proposed a simpler yet effective machine learning approach: they apply TF–IDF (term frequency–inverse document frequency) vectorisation to source code lines and then use K-means clustering to group similar code submissions. This method achieved an impressive 99.2% accuracy in identifying plagiarism cases, significantly outperforming a random forest classifier and the baseline MOSS system on their evaluation dataset. Notably, the authors report that their approach even outperformed MOSS when MOSS’s own output was used at lenient 80% and strict 90% similarity thresholds. The high accuracy is attributed to grouping similar lines of code, which helps isolate shared code segments even if they appear in different positions or contexts. These studies exemplify a trend: by treating code like data and leveraging algorithms to learn similarity, we can uncover plagiarism that evades rigid matching rules. However, machine learning models require carefully curated training data (including many examples of plagiarised vs. non-plagiarised code) and can be seen as black boxes, so explaining their decisions can be difficult. Nonetheless, as coding education datasets grow and code embedding techniques improve, we are likely to see even greater use of AI in plagiarism detection.

Dynamic and behavioural analysis:

A fundamentally different approach to plagiarism detection is to examine how the code behaves rather than how it is written. The rationale is that two programs solving the same task in a plagiaristic manner will exhibit similar runtime behaviour (for example, similar outputs or state changes for given inputs), even if their code structure has been substantially refactored. BPlag, introduced by Cheers, Lin and Smith (2021), is a leading example of behavioural plagiarism detection. Instead of relying on static code structure, BPlag uses symbolic execution to explore the program’s behaviour: it runs the code conceptually to capture its execution paths and outputs in a symbolic form. The result is a graph-based representation of program behaviour for each submission. Plagiarism is then detected by comparing these behaviour graphs – if two programs yield highly similar graphs, they likely implement the same logic in a similar way. Because behaviour is much harder to disguise than syntax (any two correct solutions to a well-defined problem must ultimately exhibit the same functional behaviour, regardless of coding style), BPlag is extremely robust to plagiarism-hiding transformations.

Empirical evaluations showed that BPlag outperformed five popular code plagiarism tools in detection accuracy and robustness, correctly flagging copied code even in cases where others failed. The trade-off, however, is efficiency: analysing program behaviour via symbolic execution and graph matching is computationally intensive. Cheers et al. (2021) reported that while BPlag was more accurate, it was also slower (less efficient) than traditional tools. Moreover, dynamic approaches require the code to be executable (or at least symbolically executable) and correct with respect to some specification, which might not hold for all student submissions. Despite these caveats, behavioural analysis adds a powerful dimension to plagiarism detection, complementing static analysis. It can be especially useful as a secondary check for suspicious cases where static similarity is borderline – confirming plagiarism by showing that two pieces of code do the same thing in the same way. We expect future systems to integrate behavioural metrics alongside static code similarity for a more comprehensive assessment.

Hybrid systems and enhanced tooling:

Modern code similarity checkers increasingly combine multiple techniques and place emphasis on usability. A noteworthy trend is the development of educator-friendly tools that integrate advanced algorithms with intuitive interfaces for investigation. The tool Dolos (Maertens et al., 2022) exemplifies this direction. Dolos is a language-agnostic plagiarism detection platform that incorporates state-of-the-art similarity algorithms under the hood and presents results with interactive visualisations. It uses generic parsing models to support a wide range of programming languages, lowering the barrier for instructors to use it on varied assignments. In a benchmark on a standard dataset, Dolos was shown to outperform other plagiarism detection tools in identifying potential plagiarism cases. Its interactive interface then allows teachers to drill down into the results – for example, highlighting identical or similar code segments between two submissions and providing summary metrics. Such visual evidence helps in communicating and proving plagiarism cases to students or academic boards.

Another advanced system is PlaGate, proposed by Cosma and Joy (2012), which is designed not only to detect plagiarism but to aid in its investigation using latent semantic analysis (LSA). PlaGate can integrate with existing detectors like MOSS or JPlag: once those tools find candidate pairs of similar files, PlaGate applies LSA (a natural language processing technique) to the source code to identify the most significant common fragments. It then produces graphical representations that show how much each shared code fragment contributes to the overall similarity. In effect, it helps an investigator focus on the unusual coincidences (distinct fragments that appear only in the pair of suspect files and not elsewhere in the class) which constitute strong evidence of plagiarism. This addresses the earlier point about the burden of proof – by visualising the “smoking gun” sections of code, tools like PlaGate make it easier to argue a plagiarism case convincingly. A similar concern for user experience is seen in recent work by Liu et al. (2023), who designed a plagiarism detection system with a focus on teachers’ needs. Their system parses code into an intermediate form and uses the RKR–GST string tiling algorithm for similarity, effectively blending structural parsing with proven substring matching techniques. Importantly, they emphasise an easy-to-use interface and integration into classroom workflows (e.g. handling bulk submissions efficiently), acknowledging that even the best algorithm is of little use if instructors find the tool cumbersome. In summary, the cutting edge of source code plagiarism detection lies not in any single algorithm but in orchestrating multiple methods and delivering results in a human-interpretable way. Tools are becoming more language-flexible, more robust to tricky modifications, and more geared towards practical deployment at scale.

Discussion: effectiveness and ongoing work

The evolution of plagiarism detection techniques has substantially improved our ability to catch copied code, yet challenges remain. Traditional text-based methods are fast and catch blatant copying, but can miss cleverly disguised plagiarism. Structural and AST-based methods dig deeper but can struggle with extreme rewrites or added irrelevant code. Machine learning approaches promise high accuracy and adaptability, but they require extensive training data and can be opaque in their reasoning. Dynamic analysis is highly robust but may be too slow for large-scale use and is limited to functional code. Consequently, modern systems often blend approaches to balance these factors, as we have seen with integrated tools like Dolos and others.

One persistent issue is evasion by obfuscation. As detection improves, plagiarists find new ways to camouflage code. For example, inserting dummy code that does not affect functionality, or encoding logic in different ways (such as using different algorithms or APIs to achieve the same result) can still fool many detectors. Research continues on making detectors obfuscation-resilient. Techniques such as normalising code (e.g., sorting independent code blocks, stripping or abstracting identifiers, etc.) before comparison can mitigate some obfuscations. Machine learning models, especially those based on code embeddings or graph representations, tend to be more robust to renaming or reordering tricks because they infer a higher-level similarity. Indeed, recent detectors explicitly evaluate robustness against known obfuscation patterns as a metric for success. Future work is likely to incorporate even more semantic understanding, possibly leveraging advances in program analysis and AI (for instance, using graph neural networks on program dependence graphs, or employing large language models trained on code to judge similarity in logic).

Another area of active development is cross-language plagiarism detection. Most current tools operate on one programming language at a time; however, a crafty plagiarist might translate a solution from, say, Java to Python, preserving the algorithm. Cross-language code similarity is a hard problem because syntax differs greatly, but the underlying logic might be the same. Some progress has been made using language-agnostic representations: for example, representing code as language-independent tokens, ASTs, or even universal intermediate languages (like an LLVM bytecode) to compare logic across languages. Fuzzy matching techniques and semantic similarity measures (such as cosine similarity on vector embeddings of code) have shown promise in detecting cross-language code reuse (Acampora & Cosma, 2015; Ramirez-de-la-Cruz et al., 2015). As more educational settings adopt multiple languages, we expect tools to incorporate cross-language checks.

Performance and scalability are also critical. When classes or code repositories contain thousands of programs, even an $O(n^2)$ pairwise comparison becomes untenable. Algorithms like winnowing fingerprints and clustering (as used in Mehsen & Joshi’s approach) help reduce comparisons by quickly eliminating dissimilar pairs. There is interest in employing efficient indexing or hashing of code representations so that potential matches can be found in sub-linear time, akin to how search engines index web pages. Moreover, cloud-based plagiarism detection services (such as MOSS’s online server or newer platforms) distribute the computation. With the rise of online learning and remote assessment – accelerated by the COVID-19 pandemic – efficient plagiarism detection has become even more crucial, as instructors may need to automatically screen hundreds of submissions on tight timelines. Research in this direction includes optimising algorithms for speed, as well as leveraging parallel processing and modern hardware (GPUs for neural network-based detectors, for example).

Finally, usability and integration remain important practical considerations. The best plagiarism detector is ineffective if educators or developers do not use it. Studies have noted that many teachers rarely use plagiarism detection tools, sometimes due to the effort required to interpret results or integrate the tool into their grading workflow. Addressing this, current systems provide better user interfaces, visual analytics, and integration with learning management systems. For example, Dolos’s interactive web interface and visual evidence, or Mehsen & Joshi’s development of a user-friendly GUI, aim to make these tools accessible and actionable in real-world settings. Future systems might incorporate more automation (e.g., automatic grouping of suspicious clusters of submissions) and even prevention mechanisms (like similarity checks during development, encouraging students to write their own code). There is also a growing conversation about academic honesty in the age of AI, as code generation tools (copilot-like systems or large language models) could be used by students – detecting AI-generated code or distinguishing it from student-written code might become part of the plagiarism detection landscape, though that extends beyond classic pairwise plagiarism into authenticity verification.

Wrapping up…

Source code plagiarism detection has advanced from simple text comparisons to sophisticated multi-faceted analyses combining lexical, structural, semantic, and dynamic techniques. This progress was driven by the need to stay ahead of increasingly crafty plagiarism tactics and to accommodate the diverse ways in which the same program logic can be expressed in code. Modern plagiarism detectors leverage everything from parsing and graph analysis to machine learning and clustering, achieving higher accuracy and robustness than ever before. They not only detect copied code with greater confidence, but also assist in presenting evidence – an important factor for real-world adoption. Academic research, including comprehensive reviews and innovative tools, has laid a strong foundation, and many of these ideas are transitioning into practical tools used in classrooms and industry.

Yet, the battle against code plagiarism is an ongoing one. Plagiarists will continue to find new ways to conceal copying, and detectors must continuously evolve. It is arguably a moving target that parallels the broader software engineering arms race between obfuscation and de-obfuscation techniques. The holy grail is a plagiarism detector that can truly recognise when two programs mean the same thing, regardless of how they are written – a goal aligned with advances in program synthesis and understanding. Reaching this goal will require further interdisciplinary research, combining static and dynamic analysis with machine intelligence and even insights from cognitive science on how humans recognise algorithmic similarity. The challenges of scalability, cross-language detection, and user adoption must also be met to make any technical breakthroughs widely useful.

In conclusion, detecting source code plagiarism is a technically demanding task crucial for maintaining integrity in computer science education and beyond. Thanks to the intensive research over the past decades, we now have a toolkit of effective methods: from token-based string matching to AST and graph analysis, and from ML-driven classifiers to execution-based comparison. These methods complement each other, and the most powerful systems integrate multiple approaches. By continuing to refine these techniques and addressing new forms of cheating, the community strives to ensure that honest coding and originality are appropriately rewarded. The ongoing work in this field not only deters academic dishonesty but also contributes to advances in program analysis and similarity detection that have wider applications in software engineering. The pursuit of ever more reliable code plagiarism detection will undoubtedly persist, mirroring our evolving understanding of code, language, and algorithmic equivalence.

References

Maertens, R., Van Petegem, C., Strijbol, N., Baeyens, T., Jacobs, A., Dawyndt, P., & Mesuere, B., 2022. Dolos: Language-agnostic plagiarism detection in source code. Journal of Computer Assisted Learning, 38(4), pp.1046–1061. https://doi.org/10.1111/jcal.12662
Novak, M., Joy, M., & Kermek, D., 2019. Source-code similarity detection and detection tools used in academia: A systematic review. ACM Transactions on Computing Education (TOCE), 19(3), Article 27. https://doi.org/10.1145/3313290
Cheers, H., Lin, Y., & Smith, S., 2021. Academic source code plagiarism detection by measuring program behavioural similarity. IEEE Access, 9, pp.50391–50412. https://doi.org/10.1109/ACCESS.2021.3069367
Eppa, A., & Murali, A., 2022. Source code plagiarism detection: A machine intelligence approach. Proceedings of the 2022 IEEE Fourth International Conference on Advances in Electronics, Computers and Communications (ICAECC), pp.1–7. https://doi.org/10.1109/ICAECC54045.2022.9716671
Kustanto, C., & Liem, I., 2009. Automatic source code plagiarism detection. Proceedings of the 10th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), pp.481–486. https://doi.org/10.1109/SNPD.2009.62
Liu, T., Zhao, Z., Fang, H., Huang, Q., & Zhang, W., 2023. Design and implementation of code plagiarism detection system. Proceedings of the 2023 4th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), pp.188–195. https://doi.org/10.1109/AINIT59027.2023.10212887
Yasaswi, J., Katta, B., & Purini, S., 2018. Machine learning for source-code plagiarism detection. Master’s Thesis, International Institute of Information Technology, Hyderabad, India.
Mehsen, R., & Joshi, H., 2024. Detection of source code plagiarism utilizing an approach based on machine learning. International Journal of Computing, 23(1), pp.78–84. https://doi.org/10.47839/ijc.23.1.3438
Mei, Z., 2011. AST-based code plagiarism detection method. Application Research of Computers, 28(10), pp.3775–3778. (In Chinese).
Cosma, G., & Joy, M., 2012. An approach to source-code plagiarism detection and investigation using latent semantic analysis. IEEE Transactions on Computers, 61(3), pp.379–394. https://doi.org/10.1109/TC.2011.223