Let’s face it, even the most expert journal editors can’t be up-to-date on the entire canon of their field and certainly not memorize all of it. So as the rate of scholarly publishing increases, more and more editorial teams are turning to iThenticate from Turnitin and Similarity Check, Crossref’s sister service for its members to scan incoming submissions for possible plagiarism.
Of course, while referred to as “plagiarism detection software,” these tools can’t definitively determine whether a work contains plagiarism. In fact, as noted in one of its help guides, “Turnitin actually does not check for plagiarism […].” Instead, it checks submissions against its vast database of academic and general content to flag cases of similar writing. Ultimately it’s up to editors to review similarity findings and decide whether any are of concern, which can be time-consuming if they don’t know the best ways to format and interpret similarity reports.
So how should editors approach similarity report review to ensure accuracy while maximizing efficiency? And what are the latest plagiarism software features users can leverage? To get answers to FAQs, we spoke with Fabienne Michaud, Product Manager at Crossref, who focuses on Similarity Check development. Similarity Check is built off of the newest version of iThenticate, iThenticate v2, which is currently only available to Similarity Check subscribers, making it the most advanced plagarism detection system available.
Before we dive into the interview, we’ll quickly walk through how to set up and interpret similarity reports for those new to it. If you’re a seasoned pro, you can scroll straight to the FAQs for in-depth insights and recommendations.
Quick guide: How to set up and read similarity reports (with examples)
1. Sign up for iThenticate or upgrade to iThenticate v2 if you haven’t already
Of course, before you can review similarity reports, you’ll need to sign up for iThenticate. You can do this either: 1) directly via the iThenticate website OR 2) via Crossref’s sister service Similarity Check, which offers Crossref members reduced rate access to the latest version of iThenticate (v2), currently only available through Similarity Check. If you have an existing Similarity Check account, you’ll need to check if you’re using the iThenticate v1 or v2 version and upgrade to v2 if you haven’t already to get all the latest features.
Crossref members interested in Similarity Check should know that to remain eligible for that service, they must include full-text Similarity Check URLs in the metadata of at least 90% of past and future Crossref Digital Object Identifier (DOI) deposits across their journal prefixes. That metadata is then indexed by Turnitin, helping to expand its content database and providing Similarity Check members peace of mind that any similarity between their content and manuscripts checked by other publishers will be flagged.
With the discounted access to the latest version of iThenticate and added content indexing benefits Similarity Check offers, if you’re a Crossref member or thinking about becoming one, it’s well worth considering that option.
Similarity Check can also be integrated with Manuscript Tracking Systems (MTS) to have similarity reports auto-generated for incoming submissions, including Scholastica’s Peer Review System. Scholastica’s OA Publishing Platform also comes with a Crossref DOI registration integration option, which includes full-text Similarity Check URLs in metadata deposits to help Crossref members meet that Similarity Check requirement. And we can help with depositing Similarity Check URLs for back-issue imports. You can learn more about Scholastica’s ready-to-go Similarity Check integration here.
2. Configure your account and set exclusions as needed
Once you’ve subscribed to iThenticate either directly or via the discounted Similarity Check service, it’s time to get your account configured (e.g., add users, choose submission indexing options, etc.) and apply similarity report exclusions as desired. You can find a complete guide to iThenticate account configuration here and Similarity Check account configuration here.
In this section, we’ll focus on the many exclusion options you can apply universally from your configuration settings or ad hoc within the similarity report viewer. Exclusions enable Similarity Check users to skip similarity-checking steps that aren’t relevant to their plagiarism detection processes or create too much noise. Publishers/journals can decide whether to include or exclude content from any of the following repositories in their similarity checks to start:
- Crossref: Research articles, books, and conference proceedings provided by publishers of scholarly content all over the world
- Crossref posted content: Preprints, e-prints, working papers, reports, dissertations, and many other types of content that have not been formally published but are registered with Crossref
- Internet: A database of archived and live publicly-available web pages, including billions of pages of existing content, and with tens of thousands of new pages added each day
- Publications: Third-party periodical, journal, and publication content, including many major professional journals and business publications from sources other than Crossref Similarity Check members
- Your Indexed Documents: Other documents Similarity Check users have uploaded for checking (within Crossref Similarity Check user account only (these sources are not added to iThenticate’s main indexes)
From there, there are options to exclude any of the following from similar content at both the publisher/admin and individual report level:
- Quotes: Anything in quotation marks
- Bibliography: bibliography or equivalent references/works cited sections
- Abstract: the abstract section of submissions
- Methods and materials: the methods and materials section of submissions
- Citations: Using machine learning, Similarity Check identifies and excludes inline citations in the APA, MLA, and Turabian style formats
- Preprints (new to iThenticate v2): Content posted to preprint repositories (e.g., arXiv)
- Small matches: Instances where there’s a similarity between eight words or less (with options for customization to increase/decrease that word count threshold)
- Custom sections: allows administrators to exclude custom sections of manuscripts using available templates (e.g., acknowledgments) or by creating their own
- Websites: allows administrators to exclude all content from specific websites
There is no “right” answer for what you should or should not exclude from similarity reports. It’s ultimately up to publishers and editors to decide based on their journals’ disciplinary norms and unique needs. For example, journals in the biomedical and life sciences may exclude methods and materials sections because they expect authors to repurpose them in different instances. We delve into advice for choosing exclusions in the FAQs below.
3. Implement text overlap policies per your plagiarism detection process
Once you’ve decided which exclusions to implement at the admin or individual report level and applied admin-level settings, it’s time to start reviewing similarity reports for incoming submissions. Each report consists of two parts, as pictured above: 1) a “Similarity Score” or cumulative percentage of the amount of text in the manuscript that overlaps with existing content and 2) a text overlap analysis that links to documents with detected similarity.
It’s essential for editors to review both report outputs within context as a high Similarity Score does not necessarily indicate plagiarism, and not all text overlaps are cause for concern. Within the text overlap report view, editors can see a summary of all similarities found and click each source to view individual comparisons. They can then apply exclusions at the report level as desired to clear out noise and narrow in on any signs of possible plagiarism. Editors should look to single source similarity percentages in the report view to spot potential areas of concern. For example, if a document has a relatively low overall similarity score of 10%, but it’s all attributed to a single source that is not a direct quotation, that warrants a closer look.
All journals should establish clear definitions for what they deem plagiarism versus acceptable cases of duplicate publication (e.g., preprints, if they allow posting). For more information on how to establish plagiarism detection policies and processes for addressing possible plagiarism, check out this Scholastica blog post and webinar on plagiarism detection best practices.
Answers to FAQs: Interview with Crossref Similarity Check Product Manager Fabienne Michaud
Now, on to similarity report review FAQs! Many thanks to Fabienne for taking the time for this interview!
Q. How should publishers and journals decide what to exclude in similarity reports?
FM: Exclusions are for sections that can make a lot of noise in similarity reports that publishers may want to disregard across all of their journals or that editors may choose to disregard on a manuscript-by-manuscript basis at the report level. There are standard exclusion options like citations and quotes. You can also set custom exclusions with iThenticate v2 using standard section templates like acknowledgments and funding or by creating your own, which I think is a great feature to leverage. The options to exclude specified websites from all similarity reports and/or preprints are also new features available only with iThenticate v2. Whichever options you choose, there will obviously be an impact on your similarity score.
A quick note on preprints — if you choose to include them, they will be clearly marked in your similarity report, and, for each source, you’ll be able to see whether it is a preprint or not to quickly determine if there’s a concern of plagiarism or if the preprint is the source of the manuscript received. Not all preprints are currently indexed as preprints in Turnitin’s content database, but most are indexed and can be compared against.
Editors are using exclusions in different ways and for different reasons based on their needs. For instance, some want to exclude affiliations from similarity reports. On the other hand, some people want to keep affiliations in because that can help editors identify possible paper mill activity. For example, if you’re looking at the affiliations of all of the authors and they’re all from different countries and interested in completely different subjects, it raises some alarm bells. You can make these changes across the board for all your journals, but you can also work with exclusions at the report level. So the editor themself can make those changes as they see fit.
You can also decide which repositories you want to search. Most people will want to search across all repositories. As mentioned previously, there are several repositories. Turnitin has a handy document which details how each can help identify different types of plagiarism and how often internet sources are crawled. The private repository tends not to be used by large publishers, but small and medium publishers may find that comparing incoming manuscripts with previously received ones is of real benefit.
So the functionality is there for people to do what best fits their needs. It’s ultimately a publisher and editorial decision. Whatever the publisher and editors choose, the main thing is to establish clear policies and procedures for plagiarism detection.
Q. How common is paper mill activity, and are journals in some disciplines more susceptible?
FM: It’s difficult for me to say how frequent it is, but it doesn’t just affect biomedical science. There have been instances in other disciplines, such as engineering, computer science, etc. I speak to many publishers, and they all deal with these kinds of issues, but they’re all affected at different rates and in different ways.
I can say that paper mills are a recurring theme in all of the Similarity Check Advisory Group meetings. We’re regularly discussing the development of paper mills and signs of paper mill activity that we’re trying to address with iThenticate and at Crossref on a wider scale via our Integrity of the Scholarly Record initiatives.
Q. What signals should editors look for when reviewing similarity reports to spot possible plagiarism versus acceptable text overlap?
FM: Editors need to make a judgment as to whether there is plagiarism or not. The software does not detect plagiarism, it highlights similarities. I think editors should focus on the amount of overlap between the similarity report they’re reviewing and the sources. So the editors need to look at the actual highlighted blocks of text and compare them to the sources. Viewing the text in context is the best way to establish whether there is plagiarism.
In instances of overlap, you want to look at how many words are highlighted by the similarity report and then make an informed decision about whether that requires investigation. Just because something gets highlighted in a report doesn’t mean it’s plagiarism. For example, the author may have forgotten to add quotes. If a chunk of text highlighted in the similarity report is actually referenced in the bibliography, the solution may be as simple as the editor telling the author to add quotations.
Q. Is it OK for journals to set thresholds for what they deem to be concerning similarity scores, and is there an acceptable percentage of similarity?
FM: Where a section of a submission’s content is similar or identical to one or more sources, it will be flagged for review. That doesn’t automatically mean plagiarism or self-plagiarism (text recycling) has occured, however — just similarity.
I think the first piece of advice I’d like to give is that decisions should not be based on a similarity score alone — basically not to rely entirely on an algorithm. Manuscripts should not be systemically dismissed because the score is above a certain threshold. But I know this might be difficult for some publishers due to the large number of manuscripts they receive.
A high degree of overlap may indicate a well-researched document with many references to existing work. And as long as those sources are quoted and referenced correctly, that is acceptable. Equally, a low score can be suspicious and mean a high level of paraphrasing — often with the intent to deceive — which iThenticate cannot currently detect.
Q. Do you have any advice for ways editors can review similarity reports faster while still doing their due diligence?
FM: By using the latest functionality within iThenticate v2, I think editors will save a lot of time. For example, with the new red flag feature, the software can detect characters in white ink to alert editors to authors trying to cheat the system. It would have missed that before because if something was disrupting a block of text, the system previously would not understand there was continuity and similarity. With the new red flag feature, iThenticate v2 can detect when authors have replaced Latin characters with Greek or Cyrillic characters also, for instance. Sometimes there is a clear intention to deceive, but sometimes it is simply a human error. So, again, the editors need to decide because the situation might not be immediately clear-cut.
It’s really about making use of all of the feature options. People can cut as much of what they consider “noise” as they want to see only what they need. They can exclude bibliographies and things that are within quotation marks. They can exclude citations, the abstracts, methods and materials sections, and even small matches where there’s a similarity of eight words or less. They can also increase small match parameters if needed. And now, with v2, they can exclude preprint sources. So if they make use of these exclusions or at least factor the types of text overlap they deem to be more or less concerning into their similarity report assessment processes, I think they’re sure to be efficient.
Q. Should journals ever share similarity report findings with authors?
FM: From everyone I have spoken to, nobody shares the full similarity report with the authors. And I don’t think that’s advisable because you don’t want to give ammunition to fraudsters and paper mills. I think publishers are careful about what information they share about the tools they’re using. They know the more information they give authors, the more resources and tools paper mills will have to manipulate the publication process.
Instead, publishers tend to provide authors with general comments. So if some sections of a paper are heavily paraphrased or exhibit text recycling (i.e., an author reusing their own words), then the publisher will note where there are concerns and ask the author to make updates. Publishers may copy and paste parts of the similarity report to highlight specific sections as needed. So, for instance, if somebody has forgotten to include quotation marks somewhere, that’s easy. You can select that set of sentences and include that in an email to the author, for instance. Another difference with the similarity report in iThenticate v2 versus v1 is that the similarity report in v2 can be copied and pasted, which wasn’t possible before. So that’s an advantage in those situations.
Q. What guidelines should publishers and journals look to when developing plagiarism detection policies and processes?
FM: They should look to the latest guidance from the Committee on Publication Ethics (COPE), which offers highly actionable and relevant case studies and tools. For example, COPE has a great flowchart on how to handle possible plagiarism in a submitted manuscript.
Another place to go is the Text Recycling Research Project out of Duke University, with funding from the National Science Foundation. That has a lot of helpful information on defining and recognizing different forms of text recycling or self-plagiarism, which can help publishers and journals decide on policies for when recycling is and isn’t a concern for them. The recommendations are evolving, and these are great resources to help publishers keep up.
Thanks again to Fabienne for taking the time for this interview! Have a question we didn’t cover here? Feel free to post it in the comments section below or tweet us at @scholasticahq.