Stylometric methods for plagiarism detection: an authorship attribution approach

Summary:

  • Stylometry shifts plagiarism detection from content-matching to analysing unique writing styles.
  • It can reveal ghostwriting, contract cheating, and stylistic inconsistencies missed by traditional tools.
  • While effective, it faces challenges with short texts, style variation, and deliberate obfuscation.

Plagiarism detection is a critical component of academic integrity and intellectual property protection. Traditional plagiarism detection systems rely on text-matching algorithms. These tools find identical or highly similar passages between a submitted work and existing sources. However, these methods can fail to detect more subtle forms of plagiarism, such as paraphrasing, ghostwriting, or contract cheating. In such cases, an essay or article may be original in terms of content.

However, it is actually authored by someone other than its purported author. Therefore, a different strategy is necessary to expose these inconsistencies. Stylometric analysis, which examines writing style features rather than content, has emerged as a powerful approach to address this challenge. Indeed, stylometry focuses on the unique linguistic fingerprint of an author. It examines the patterns in syntax, vocabulary, and composition that tend to remain relatively consistent across their works.

This article explores stylometric methods of plagiarism detection. It focuses specifically on authorship attribution techniques that detect variations in writing style to identify copied or misrepresented work. The discussion highlights how these approaches can uncover subtle plagiarism (for example, ghostwritten assignments). It also examines the technical foundations, applications, and limitations of stylometric analysis in an academic context.

Stylometry and authorship attribution

Stylometry is the computational study of linguistic style. Analysts use stylometric techniques to determine authorship based on measurable writing characteristics. It operates on the principle that every writer has distinctive habits. These tendencies manifest, consciously or unconsciously, in their use of words, sentence structures, punctuation, and other elements of text.

A famous example is the analysis of the Federalist Papers in the 1960s. In that case, Mosteller and Wallace applied statistical methods to determine which of the Founding Fathers wrote disputed essays by examining the frequency of certain function words (Mosteller and Wallace, 1963).

More recently, stylometric techniques were responsible for revealing novelist J. K. Rowling as the real author behind the pseudonym Robert Galbraith. Analysts achieved this by comparing the linguistic patterns of the mystery novel to Rowling’s known writing style (Juola, 2017).

These cases illustrate the core goal of authorship attribution. Given a piece of text of unknown or disputed origin, the aim is to determine the most likely author based on stylistic traits rather than content (Neal et al., 2018).

In the context of plagiarism detection, stylometry-based authorship analysis is used to identify whether the claimed author of a document is genuine. This approach can be applied in two closely related tasks.

The first is closed-set authorship attribution. In this scenario, an unknown text is compared against writing samples from a set of candidate authors to find the best match. The second is authorship verification – a task that decides if the same person wrote both documents (Neal et al., 2018).

For plagiarism detection in academic settings, the verification scenario is especially relevant. For example, suppose a student submits an essay that is stylistically inconsistent with their previous assignments. A stylometric verification algorithm can compare the new essay with the student’s known writing samples. If the style differs significantly, this discrepancy may indicate that someone else wrote the essay, suggesting ghostwriting or contract cheating. Therefore, stylometry provides an intrinsic way to assess authorship authenticity even when no direct copy of the content can be found elsewhere.

It is important to note that stylometric analysis essentially serves as an intrinsic plagiarism detection method. It evaluates the writing style within one document or across documents from the same author to search for anomalies. In intrinsic analysis, the system flags segments of text that deviate from the rest of the document’s style. This approach can reveal passages that appear to originate from a different author, or inconsistencies that might result from collusion (Stein et al., 2011).

Such techniques are complementary to traditional plagiarism checkers. While text-matching can catch explicit copying, stylometric methods can catch cases where the content is original or paraphrased but the writing style is suspect.

Stylometric features and linguistic fingerprints

Stylometric attribution relies on quantifying aspects of writing style. Scholars have developed a rich set of stylometric features to capture the essence of an individual’s writing. These features can be broadly categorised into lexical, syntactic, structural, and semantic measures (Neal et al., 2018).

Lexical features are based on the distribution of characters and words in the text. They include simple metrics such as average word length, sentence length, and vocabulary richness. Stylometric analysis also considers frequency counts of particular words or character n-grams. Function words (common words like “and”, “the”, “but”) are very telling.

Authors tend to use them in unconscious patterns. A classic lexical signature is the relative frequency of certain function words or pairs of words. For example, one author might consistently use “while” whereas another prefers “whilst” in the same contexts (Mosteller and Wallace, 1963; Juola, 2017). These subtle preferences become statistical markers of identity.

Lexical analysis is robust to minor spelling or grammatical errors. It often provides a foundation for stylometry, because even paraphrased or translated text will carry over many low-level linguistic habits of the original writer.

Syntactic features add another layer by examining the structure of sentences. This category includes usage patterns of punctuation (for example, how often an author uses commas or semicolons). It also encompasses part-of-speech tag frequencies (how frequently nouns, verbs, adjectives occur) and common phrase structures.

Two authors may convey the same idea with different syntactic structures. For example, one writer might favor complex multi-clause sentences, whereas another tends to write shorter, more straightforward sentences. Such tendencies can be quantified through formal measures. For example, one can calculate the average sentence complexity or the frequency of subordinate clauses. Syntactic traits are considered relatively difficult for an author to consciously alter, so they provide another reliable signature.

Structural features refer to document-level style and formatting choices. Structural patterns in academic writing include how references are formatted and whether the author consistently includes certain sections. For instance, one student might always start an essay with a brief outline, whereas another dives straight into the introduction. Perhaps only one of them habitually writes in British English spelling while the other uses American spelling. These patterns fall under structural style and can be incorporated into a stylometric profile (Neal et al., 2018; Sarwar et al., 2018).

Beyond lexical, syntactic, and structural signals, researchers have also explored higher-level semantic and idiosyncratic features. Semantic analysis might include examining word choice preferences or topic-specific vocabulary usage. Even if two authors write about the same subject, the particular words and metaphors they choose can differ. Idiosyncratic habits (such as overusing certain phrases or asking rhetorical questions) also contribute to an author’s fingerprint.

Some modern stylometric systems incorporate psycholinguistic features as well (Athira and Thampi, 2018). For example, they might measure how frequently an author uses words from various psychological categories or whether the tone is formal or colloquial. These attributes can be derived from dictionaries like LIWC (Linguistic Inquiry and Word Count) to add another dimension to the style profile.

Crucially, effective stylometric analysis often combines many such features to build a comprehensive representation of style. Individual features might not be unique – many authors may have a similar average sentence length, for instance – but the combination of dozens or hundreds of markers yields a distinctive fingerprint.

In practice, researchers typically preprocess the texts before analysis. This may involve converting all words to lowercase, removing punctuation (for lexical analysis), and sometimes filtering out rare words or correcting obvious typos (Neal et al., 2018). Such normalisation ensures that the features reflect genuine stylistic choices and not irrelevant differences.

Machine learning techniques for style analysis

Translating stylistic fingerprints into a decision about authorship involves statistical and machine learning techniques. In earlier stylometric studies, researchers used simple statistical methods. For example, they often compared word frequency vectors using cosine similarity or Pearson correlation, or applied chi-square tests on word usage (Neal et al., 2018). Modern approaches have expanded to include a variety of supervised and unsupervised learning algorithms that can handle high-dimensional style features.

One fundamental approach is to treat authorship attribution as a classification problem. Given a set of documents by known authors (the training set), we can train a classifier to recognise each author’s style based on the features described above.

Machine learning classifiers such as support vector machines (SVMs), logistic regression, and random forests have all been successfully applied to this task (Stamatatos, 2009). Researchers have successfully applied classifiers such as support vector machines (SVMs), logistic regression, and random forests to this task (Stamatatos, 2009).

Each author in the training data constitutes a class. The feature patterns from their texts form the basis to distinguish that class. When a newly submitted essay is automatically classified, if the predicted author is not the student who submitted it, this discrepancy is a red flag indicating possible ghostwriting.

Another important method is the profile-based approach to authorship attribution (Stamatatos, 2009). Instead of treating each known document separately, all known writings of a particular author are merged into a cumulative profile. The algorithm then compares the unknown document to each author’s profile using a distance or similarity metric.

One influential metric in stylometry is Burrows’s Delta, which measures the difference in word frequency distributions between texts. Burrows’s Delta and its variations have proven effective for authorship attribution, especially in literary analysis tasks (García and Martín, 2012). Essentially, the algorithm calculates how “far” an unknown text’s style is from each candidate author’s style and picks the nearest match. Profile-based methods can be robust when dealing with limited data per author, as they make use of all available writing from each candidate to form a representative style signature.

Clustering and unsupervised learning are also used in stylometric analysis. Clustering algorithms (such as hierarchical clustering) can group documents by writing style without prior labels (Ison, 2020). If one student’s submitted assignments naturally cluster together but one assignment falls into a different cluster, it suggests a different writing style for that work.

Outlier detection methods operate on a similar principle. They flag any piece of writing that lies outside the stylistic norm of a student’s work for further scrutiny (Ouriginal, 2021). Some plagiarism detection tools implement this by computing a “profile” for each student and then identifying submissions that deviate significantly from that profile. For example, a piece of coursework that is an extreme outlier compared to a student’s usual writing may be highlighted for the instructor.

Research in stylometry has also embraced more advanced techniques like neural networks. These models can learn to encode writing style in high-dimensional representations, potentially capturing subtle sequential patterns. However, these models usually require a lot of data to train effectively. This can be a limitation in many practical authorship verification cases where only a few writing samples per author are available.

A hybrid approach that has shown promise is to use deep learning to extract features – for example, using embeddings or autoencoders to capture stylistic nuances. These features can then be fed into a simpler classifier to make the final attribution decision (Posadas-Durán et al., 2017).

Regardless of the technique, the trend in recent years has been to combine multiple methods in an ensemble or through stacked generalisation. In other words, analysts let different algorithms “vote” on the authorship decision.

Patrick Juola and colleagues, for instance, developed an ensemble framework implemented in their tool JGAAP (Java Graphical Authorship Attribution Program). This system tries a variety of algorithms and features for a given problem (Juola et al., 2006). Such flexibility is valuable because the optimal feature set or algorithm may differ by context (email messages, formal essays, social media posts all have different characteristics).

A noteworthy point in authorship analysis for plagiarism detection is the handling of adversarial situations. As noted earlier, if a writer knows that stylometric techniques might be used, they may attempt to disguise their style (Neal et al., 2018). This phenomenon, known as adversarial stylometry or obfuscation, can involve deliberately altering writing habits or trying to imitate someone else’s style.

For example, a student who hires a ghostwriter might ask them to introduce a few spelling mistakes. The ghostwriter might also shorten sentences deliberately to mimic the student’s style. Alternatively, a student writing their own paper might try to copy the style of a source text to avoid detection of direct copying. Some advanced detection methods address this by focusing on features that are harder to consciously manipulate and by using comparative evaluation.

One technique, called “unmasking”, gradually removes the most obvious stylistic features and then checks how differences between two texts persist. If the same author wrote both documents, removing distinguishing features will eventually make them indistinguishable. In contrast, if different authors wrote them, the differences remain significant (Koppel et al., 2007).

Approaches like unmasking and the use of multiple feature types can mitigate simple attempts to evade detection. Nonetheless, adversarial plagiarism remains a cat-and-mouse game. As detection methods improve, so do the tactics for concealing true authorship.

Detecting ghostwriting and contract cheating

Perhaps the most pressing application of stylometric plagiarism detection in education today is identifying ghostwritten assignments. Ghostwriting (also known as contract cheating when a student pays a third party to complete their work) has become a widespread concern in universities.

By definition, a ghostwritten essay is original content. It will typically not trigger any alarms in standard plagiarism checkers because it has not been published elsewhere. Yet it is still academic misconduct, since the student submitting it is not the true author. Stylometry provides a way to catch such cases by detecting that the writing style of the assignment does not match the student’s known style.

In practical terms, to detect ghostwriting, you must have previous writing samples from the student for comparison. These could be earlier assignments, exam essays, or any other authentic writing by the student. The suspected document is then compared against the student’s profile.

One simple indicator can be a sudden shift in writing quality or complexity. Educators often intuitively notice when a student’s work seems far more sophisticated or polished than their usual submissions. Stylometric analysis quantifies this intuition.

For instance, suppose a student typically writes short, straightforward sentences and uses a fairly limited vocabulary. If their final paper contains numerous complex sentences and advanced vocabulary, a stylometric profile will capture that discrepancy (Crockett and Best, 2020).

Outlier detection methods will flag this new document as a stylistic outlier compared to the student’s earlier work (Ouriginal, 2021). This flag does not prove cheating by itself, but it alerts instructors to investigate further.

Recent studies have demonstrated the efficacy of stylometric methods for ghostwriting detection. In one case study, researchers analysed a portfolio of 20 assignments from a single student, and found that various contract cheating services had in fact ghostwritten eight of those assignments (Crockett and Best, 2020). By examining word and bigram frequency patterns, they were able to cluster the assignments into distinct stylistic groups. Notably, the known ghostwritten pieces grouped separately from the student-written ones.

The analysis even suggested that some of the remaining assignments – not initially known to be outsourced – likely came from the same ghostwriters. The ghostwritten papers shared stylistic hallmarks. For example, the ghostwriters used punctuation more consistently and employed a more uniform level of formal language, indicative of a professional “house style” (Crockett and Best, 2020). In contrast, the student’s own writing had more irregularities, such as inconsistent capitalization and varying levels of formality.

This study highlighted an important finding. Professional ghostwriters, even when attempting to imitate a student, tend to write in a more correct and polished style than the average student. Stylometric evidence allowed the investigators to conclude, on the balance of probabilities, that the student could not have authored all of the submissions – a clear indication of contract cheating (Crockett and Best, 2020).

Another pilot study evaluated the use of off-the-shelf stylometry software to detect contract cheating in student papers. In that research, several stylometric tools were tested on pairs of genuine student writing and simulated ghostwritten samples (Ison, 2020). The results were promising. Depending on the tool and writing scenario, accuracy ranged from about 33% up to 88%, with the best results when ample training text from the genuine student author was available (Ison, 2020). Even though performance was variable, the top-end accuracy illustrates that with the right approach, stylometry can dramatically outperform random chance in spotting ghostwritten work.

Notably, one challenge observed was that short texts reduced accuracy. This is a common issue in stylometry, since less text provides fewer style markers. Nonetheless, the trend is clear: as algorithms improve and more linguistic features are incorporated, stylometric detection of ghostwriting is becoming increasingly feasible.

Educational technology providers have taken notice of these advances. For example, Turnitin – known for text-matching plagiarism software – launched an Authorship Investigate tool. It is aimed at flagging writing that might not come from the claimed student (Turnitin, 2019).

Similarly, another platform called Ouriginal (a merger of Urkund and PlagScan) has developed stylometry-based indicators. Their approach involves comparing a suspicious paper against a cohort of peer submissions. They found that genuine student work tends to cluster together in stylistic terms, whereas a ghostwritten paper might stand out as a clear outlier across multiple metrics (Ouriginal, 2021).

For instance, most students in a class make occasional grammar mistakes or have a modest vocabulary range. If one paper is entirely error-free and lexically rich, it will lie at the extreme end of the class distribution and thus warrant a closer look. These tools do not provide a definitive verdict; rather, they offer evidence to support a human-led investigation. By using stylometry as a screening mechanism, instructors and academic integrity officers can prioritize which submissions to scrutinize for potential contract cheating.

Stylometric methods can also detect subtler forms of plagiarism beyond purchased essays. Consider a student who attempts to hide plagiarism by heavily paraphrasing text from a source. Traditional detectors might not catch it if few exact phrases remain the same. However, if the paraphrased section is inserted into a larger document the student wrote, it might carry a different stylistic signature. Intrinsic analysis can reveal these inconsistencies. The paraphrised section might have a different readability level, a different use of function words, or other tell-tale differences that mark it as likely coming from a different author.

In one approach, a long document can be segmented and each segment’s style compared to the rest. Segments that are statistically deviant (for example, significantly higher vocabulary complexity or a sudden change in sentence rhythm) can be flagged for closer inspection (Stein et al., 2011). This is useful for catching cases of patchwriting, where a student interweaves their own writing with segments adapted from sources.

Stylometry-based segmentation algorithms are often coupled with outlier detection. This combination has shown the ability to spot internal inconsistencies that might indicate plagiarised passages.

Challenges and limitations

While stylometric plagiarism detection is a powerful technique, it comes with several challenges and limitations.

First, the reliability of authorship attribution improves with the length of the texts under analysis. Many early stylometric methods were designed for literary works or long articles, where tens of thousands of words were available. In contrast, student assignments or essays might be only a few hundred words long. This limits the amount of stylistic evidence. This sparsity makes it harder to draw firm conclusions. It also results in a higher chance of false positives or false negatives.

Researchers are actively working on improving style detection for short texts. For example, they are investigating which features remain stable even in smaller writing samples. Another approach is to aggregate multiple short texts by the same author to build a composite profile (Crockett and Best, 2020).

Another difficulty lies in the variability of an individual’s writing. A student’s writing in a lab report might look different from their writing in a reflective essay for an English class. If stylometry does not account for these normal variations, it could mistakenly flag genuine work as suspicious. This might happen simply because the style needed to change for the task.

To mitigate this, advanced systems incorporate some degree of genre or topic awareness. One strategy is to compare writing only within similar contexts. For example, a student’s lab reports should be compared to other lab reports rather than to their creative writing assignments.

Additionally, analysts incorporate more high-level features like content-independent patterns (e.g. function word usage, which tends to remain constant regardless of topic). This can help reduce the impact of topic-induced variation (Sarwar et al., 2018).

Perhaps the most challenging aspect is dealing with intentional style obfuscation. As noted earlier, if a student or a ghostwriter actively tries to mask their style, some of the simpler features might be altered. It’s worth emphasising that stylometric evidence is usually not treated as irrefutable proof, but rather as supporting evidence. In academic integrity proceedings, findings from a stylometry software are typically combined with other indicators. These might include a sudden jump in grades, the student’s lack of familiarity with the submitted work when questioned, or inconsistencies in references (Rogerson, 2017).

Stylometry might show that an essay is highly inconsistent with a student’s prior writing. However, an investigator would likely seek a confession or other corroborating evidence before rendering a verdict of plagiarism or contract cheating. This cautious approach is necessary for fairness, and because stylometric conclusions are probabilistic. They operate on “the balance of probabilities” (Crockett and Best, 2020). They do not offer the absolute certainty that direct copy-paste plagiarism might provide.

Moreover, there are privacy and ethical considerations. Building stylometric profiles of students involves collecting and analysing their writing over time. Institutions must ensure that they handle such data responsibly and that students’ rights are respected.

There is also the question of consent and transparency. Should students be made aware that their writing style is being tracked? Some argue that simply knowing about the capability of stylometric checks can deter would-be cheaters. However, it also might drive contract cheating services to advertise “style-matched” ghostwriting, where the ghostwriter tries to learn and imitate the client’s style. This development would complicate detection further.

Despite these challenges, the field continues to advance. Ongoing research is focusing on multi-language stylometry, so that a student writing in a second language can still be analysed effectively. Another active area is cross-domain stylometry, which applies an author’s profile to detect their work across different genres or topics.

Researchers are also looking at ways to integrate stylometry with other types of evidence or signals. For example, if an assignment is suspected to be ghostwritten, investigators might also examine metadata clues. Document properties, formatting quirks, or typing patterns can be used in tandem with style analysis to strengthen the case (Rogerson, 2017; Crockett and Best, 2020).

There is also interest in using stylometry to detect machine-generated plagiarism, such as content produced by AI language models. This scenario introduces another layer of complexity, since the “author” in that case is not human. Nonetheless, the core principles remain applicable – every source of text, whether human or machine, has characteristic features that can potentially give it away.

Conclusion

Stylometric methods have become an indispensable part of the modern plagiarism detection arsenal. They provide a means to uncover misconduct that escapes traditional similarity checks. By focusing on how something is written rather than what is written, stylometry allows investigators to attribute authorship and detect inconsistencies. It can reveal cases of ghostwriting, contract cheating, and cleverly disguised plagiarism.

Authorship attribution techniques leverage a wide array of linguistic features – from simple word frequencies and sentence lengths to complex syntactic patterns and beyond. They use these markers to create a fingerprint of an individual’s writing style. Machine learning algorithms then compare these fingerprints. This enables the detection of anomalies where a document’s style does not match the purported author. We have seen that these methods can identify ghostwritten essays with notable success. Moreover, their effectiveness is evident from both research studies and real-world deployment in plagiarism detection tools.

At the same time, stylometric analysis is not infallible. It works best as part of a holistic approach to plagiarism detection, complementing direct text matching and human judgement. When used wisely, it provides early warnings and evidence that can prompt further investigation. The field is continually evolving, with researchers addressing current limitations such as short document lengths and intentional style masking. The goal is to improve accuracy, fairness, and reliability so that honest authors are protected and deceptive practices are exposed.

In an academic climate where contract cheating and sophisticated plagiarism are on the rise. Stylometric techniques offer a robust scientific approach to uphold integrity. They serve as a reminder that content can be faked or borrowed. However, the unique signature of an author’s voice is much harder to hide.

References

Leave a Comment

Find us on: