Hunting for plagiarism: a case study

Have an idea for the next op-ed? We welcome all submissions—for more details, please visit the newsroom!

By Fifelfoo

I asked Fifelfoo to review one of the articles I authored, South American dreadnought race, for a Milhist A-class review. There were two minor instances with similar wording, which he listed at the article's review page, but I did not realize how in-depth he went until he posted a detailed summary on his talk page. He gave me permission to post it in this op-ed [1], and I hope it will give reviewers ideas on what and how to look for instances of plagiarized text as they go through articles at GAN, A-class, or FAC. Ed [talk] [majestic titan]




Ed. note: all text below was written by Fifelfoo in these two edits

I reviewed South American dreadnought race for MILHIST A Class. To check copyvios first I determined how I should approach the issue

  1. Automated testing and individual source manual verification
  2. Full text reading for plagiarism through analysis of style changes, exceptional turns of phrase, etc. and individual source manual verification

revision reviewed

Firstly, in either technique, read the article history and look at the change logs. It is mostly one editor's work (unlikely to be adhoc plagiarism then), so then I look at the first version of the page.

The revision history shows the article grew at a steady pace in byte count for 14 days, then levelled off (obviously undergoing copyediting!). This is a good sign. Steady growth indicates normal authorial work. As does sudden spurts of growth, each spurt being the same. Sudden increases in size which are out of pattern can indicate things being lifted whole.

Finally, glancing over the final revision for review:

To conduct automated testing I used http://toolserver.org/~earwig/cgi-bin/copyvio.py and http://en.wikipedia.org/wiki/User:CorenSearchBot/manual on the article. Earwig showed clear. CorenBot showed clear. This is only the start, however, as close paraphrase and "do the sources support their conclusions" need to be checked using this method. As Full text reading proceeds to individual source checking, I'll deal with Full text reading next.

Full text reading is the process of closely reading the style and expression of a work, to look for jarring changes in style, very unusual verbs verbal clauses or adjectival constructions, material worded far poorer than average, material worded much better than average, and styles which appear to have a academic or journalistic (etc.) rather than encyclopaedic style. So I started at the top of SAdr. For example:

By this stage I've determined the editor's own prose style, and have read the rest of the article, not noticing any sudden stylistic changes. Thus I need to move to spot checking.

Spot checking relies on picking sources, footnotes, or sentences which are likely to be close paraphrase:

Now saying this doesn't mean that editors acting in such ways are plagiarising; but these are the signs which I have found when dealing with Humanities encyclopaedia articles. When I see these signs, I concentrate spot checking on sources and sentences which display this behaviour. If a source is particularly relied upon in this way I check every useage of that source.

Consider, "The United States' Fore River Shipbuilding Company tendered the lowest bid—in part due to the high availability of cheap steel—and was awarded the contract.[32]" Endnote 32: "Livermore, "Battleship Diplomacy," 39." Bibliography: "Livermore, Seward W. "Battleship Diplomacy in South America: 1905–1925." The Journal of Modern History 16: no. 1 (1944), 31–44. JSTOR 1870986. ISSN 0022-2801. OCLC 62219150."

I then repeat this for every citation of Livermore.

So I reported it in my review:

Now that I've found close paraphrase I need to overcome my sadness and run a detailed check of Livermore, thorough and suspicious.

Livermore was cited 20 times. On two occasions close paraphrase occurred. In both cases it was where a single sentence in the text displayed the same information that a single sentence in the source displayed. In both cases the verb clause remained identical. In both cases the order of presentation was sufficiently similar. This appears to be accidental close paraphrase, and not a matter of style or habit for the editor. The editor's work is fantastic, but they need to hede WP:Close paraphrase on internalisation and re-expression.

The first close paraphrase appeared in the initial article version, the second must have appeared later. This clearly indicates that these were natural slip ups and not an matter of fundamentally bad habits.

I then repeat this for two or three other sources. I chose to check the "weakest" sources, because looking through Livermore exhausted me; and, the style review passed clearly.

I saved my report, and let the author know that they need to watch when they write single article sentences from single source sentences.

It is also obvious that we need automated tools which identify verb clauses and extensively search Google Scholar.

It took me (while writing this up) 90 minutes to automatically, style read, search google / scholar for style turns of phrase, and close read Livermore and two web citations. I estimate that the cost of documenting the process was about 25-50% of the time. I estimate I spent 60 minutes reading Livermore closely. Thus, I'd estimate the cost of spotchecking a FACable article to be approximately 60 minutes.

Fifelfoo, who wrote the bulk of this article, is an Australian Wikipedian with a keen interest in the history of Hungary. Ed, who only wrote an introduction because Fifelfoo is on a wikibreak, is an American university student who has a borderline obsession with early 20th-century battleships.