The Signpost

Essay

Machine-written articles: a new challenge for Wikipedia


Interview with ChatGPT

Artificial Intelligence.

This interaction with ChatGPT was recorded on January 29, 2023

Author: Are any Wikipedia articles written by ChatGPT?

ChatGPT: No, ChatGPT is not capable of writing articles on Wikipedia. It can only generate text based on the input it receives and its training data.

Author: What should I do with a Wikipedia article someone wrote using ChatGPT?

ChatGPT: Articles generated by language models like ChatGPT are not suitable for use on Wikipedia as they are not written by humans and lack context and accuracy. If you have come across such an article on Wikipedia, you should report it to the Wikipedia community for review and potential removal, as Wikipedia has strict policies regarding the sources and reliability of the information on the platform.

My Experience

Near the end of January, I ran across some odd-sounding articles in draftspace. All the articles had the same structure and layout, even though they were about different things. The articles also had impeccable English, though the author's comments did not. Suspecting something was up, I found an artificial intelligence (AI) article detector that then pronounced all these articles machine generated with greater than 99.9% confidence. Not knowing what to do, I went to the calm environ of the Administrators' noticeboard for incidents. That discussion is worth reading, as it shows a bunch of knowledgable Wikipedians struggling and debating how to deal with these articles. Were they hoaxes? Copyright violations? Good for publication? How does generated text fit within Wikipedia's requirements for articles?

Policies?

As it turns out, an effort was already underway to develop a policy regarding articles written by ChatGPT and its relatives: Wikipedia:Large language models. There, and in its associated talk page, you can see the reasoning related to these articles. In short, AI-generated text is not reliably correct, may not have a neutral point of view, needs verification, can occasionally violate copyright, and can downright lie. This is all in its inherent nature. It is fed information from a large corpus of text, much of which would not meet Wikipedia's sourcing and neutrality criteria, and it synthesizes its output without regard as to whether the text maps to a real source. To quote the ChatGPT general FAQ: "These models were trained on vast amounts of data from the internet written by humans, including conversations, so the responses it provides may sound human-like. It is important to keep in mind that this is a direct result of the system's design (i.e. maximizing the similarity between outputs and the dataset the models were trained on) and that such outputs may be inaccurate, untruthful, and otherwise misleading at times."

Finding More

I then started going through recent drafts and new articles looking for text reminiscent of the text I had seen in the first articles I identified. It didn't take long to find more. The current limiting factor is that I don't have the tools for rapidly reviewing Drafts as I do for new articles as a new page patroller, and I don't have the resources (including time and patience) needed to do this consistently and daily.

I've put the ((AI-generated)) template on those articles and had not one author disagree with the finding. You can search for the template with hastemplate:AI-generated in the Wikipedia search box. Expand the search to Drafts to see the drafts so marked. About sixty articles have been tagged. Several of the previously tagged articles have either been deleted or de-tagged once the generated text was replaced with real text, but many remain. You can then read those remaining examples and get a feel for AI-generated articles.

Editor's note: If you are reading this in the future and none of these are available, some representative drafts have been retained more permanently at User:JPxG/LLM dungeon.

I have been conservative in identifying articles: only testing articles that had a similar appearance and using > 99% assurance that it was machine-generated. I am sure I've missed many more articles. I was looking for typical phrases like "in conclusion" starting the last paragraph, use of the article's title repetitively without abbreviation or variation, and consistent sentence and paragraph length. A more sophisticated AI user would use better prompts to the AI software and produce harder-to-detect output; the ones I've found typically were produced by asking ChatGPT something like "Write a Wikipedia article about XXXX."

I test for articles typically using https://openai-openai-detector.hf.space/, though other sites exist, including https://detector.dng.ai/, https://gptzero.me/, https://platform.openai.com/ai-text-classifier, and https://contentatscale.ai/ai-content-detector/. Before testing, I remove headings, inline references, and other text and markup that appears to have been added after text generation, as those can confuse the analyzer.

Problem?

Is this a real problem? I believe it is. Many of the articles sound reasonable but may have serious errors. The conversation at the administrators' noticeboard includes an analysis of a generated article on geckos. The article contains a lot of specific plausible-sounding information (e.g., size range), much of which is wrong.

ChatGPT will even provide references if requested, but those references are synthesized from its input text and, while sounding correct, usually do not point to real articles. For example, when I asked it for references on an article ChatGPT wrote for me on Sabethes cyaneus (a mosquito), one of the references was "Sabethes cyaneus" (Encyclopedia of Life): https://eol.org/pages/133674. That page does exist, but is for Clavaria flavopurpurea, a fungus. Another reference it provided also had a link, but the link pointed to an article about a mink, and the reference itself was fictional.

There are efforts to improve these programs, and I am sure that eventually they will be successful. Currently, however, those efforts fall well short. One such effort, "Elicit" (elicit.org), only searches research papers and summarizes them. I asked Elicit "What are the characteristics of Sabethes cyaneus?" It summarized one reference as "Sabethes cyaneus is a species of frog."

These false but plausible answers are an inherent property of the current models and is called "hallucination".

Even once these programs improve, there will still be significant concerns limiting the direct use of generated text, such as:

  1. Did the program only use sources acceptable to Wikipedia?
  2. Is the information up to date?
  3. Can the program identify correct references? (One of the biggest hurdles, as the inherent nature of current models is that the output is not linked to a specific source.)
  4. Will the program avoid hallucinating?

However, programs like ChatGPT are great for generating ideas for articles and helping to "mock up" a good article, if the user takes a sophisticated view of the output, using it more for inspiration than for a source of truth. The proposed policy Wikipedia:Large language models has additional information on how these programs can be used to improve Wikipedia.

+ Add a comment

Discuss this story

  • I've found that ChatGPT can be good at writing articles. I asked it to write an article on The Crew Motorfest (out of curiosity, not to actually write the article as I had already been written and published) and the article came out pretty decent with only 1 inaccuracy I found. I asked it source it and it came up with BS sources however so it isn't perfect. ― Blaze WolfTalkBlaze Wolf#6545 18:15, 20 February 2023 (UTC)[reply]
    Also, mind posting a link to the AI detector you used? ― Blaze WolfTalkBlaze Wolf#6545 18:16, 20 February 2023 (UTC)[reply]
The links are in the "Finding More" section, the primary one is https://openai-openai-detector.hf.space/Bri (talk) 18:21, 20 February 2023 (UTC)[reply]
Thanks! I didn't see a finding more section but I Could just be dumb. ― Blaze WolfTalkBlaze Wolf#6545 18:23, 20 February 2023 (UTC)[reply]
ChatGPT creates plausible-sounding bullshit. In cases where it has a lot of very similar sources to draw from, such as mostly-empty space-filler articles about an upcoming racing video game (for which it would have about a thousand examples) it can generate something low on nonsense. For something more unique, the bullshit quota is higher. In all cases, though, you can't tell what's bullshit without checking it line by line, because it's all plausible-sounding. Similarly, the sources will always be nonsense, because it isn't generating text based on specific sources, it's generating plausible-sounding reference text bullshit, with no connection to anything. --PresN 19:29, 20 February 2023 (UTC)[reply]
Yes I'm not trying to argue that we should be using ChatGPT (because frankly no one should), simply that it isn't 100% bad all of the time. ― Blaze WolfTalkBlaze Wolf#6545 19:31, 20 February 2023 (UTC)[reply]
IN fact I have encountered situations where it likes to hallucinate (I asked it a few things regarding Splatoon and it kept thinking the special gauge was the amount of ink the weapon had which is not true whatsoever) no matter what I tell it. ― Blaze WolfTalkBlaze Wolf#6545 19:33, 20 February 2023 (UTC)[reply]
One of the data sources for ChatGPT is Wikipedia, so if you ask it to write about something already in Wikipedia, there’s a likelihood that it will select correct information for its output. — rsjaffe 🗣️ 22:24, 20 February 2023 (UTC)[reply]
WP:Randy in Boise can also make good contributions most of the time, but the few times he's wrong still make him a net negative. AI seems to be a long way from getting past this level of ability. Daß Wölf 20:24, 24 February 2023 (UTC)[reply]

I test for articles typically using https://openai-openai-detector.hf.space/ - this and various other currently available "ChatGPT detectors" (including OpenAi's own) are highly unreliable. https://openai-openai-detector.hf.space/ actually already says on the tin that it is a detector for GPT-2 (released in 2019 and very different from ChatGPT). Given the article's focus on the dangers of misinformation, it's a bit sad and ironic that the Signpost is itself providing such dubious recommendations here without any caveats.

Regards, HaeB (talk) 11:12, 21 February 2023 (UTC)[reply]

The article glosses over a lot of the issues regarding detection. It was just a brief intro. I emphasized in the article that I was using a very insensitive method of finding LLM-generated text. There were a couple of reasons I went about things as described there (and to note: I no longer rely solely on GPT-2 detector). 1) at the time I started, other detectors available were very opaque as to how they were constructed; 2) the nature of the output, even though the models are different, has many similar characteristics, so a GPT-2 detector would have some sensitivity and specificity; 3) I intentionally minimized false positives as those irritate article contributors, by doing a vigorous pre-screen of the text. As to point two, note that at least one of the recommended detectors (https://gptzero.me/) is not based on the GPT model, but rather on the text output characteristics. As to point three, I used the authors' feedback as an indicator of the false positive rate: getting no complaints after a lot of tags is a decent indicator that the false positive rate is low. — rsjaffe 🗣️ 18:53, 21 February 2023 (UTC)[reply]
Good to hear that you are proceeding diligently when patrolling new articles (and to be clear, this is very important work and it's good to call attention to this issue). But the part with the tool recommendations was not including any caveats about false positives, and should not have been published in this form.
the nature of the output, even though the models are different, has many similar characteristics, so a GPT-2 detector would have some sensitivity and specificity - what research is this claim based on? (I mean, of course any detection method has "some sensitivity and specificity", the question is whether they are good enough.)
is not based on the GPT model, but rather on the text output characteristics - it seems that there is some fundamental confusion here between the model that is doing the detection and the model whose output is being detected (and/or the features of its output). https://openai-openai-detector.hf.space/ is also not using "the GPT model" (there are many actually) to detect GPT-2 output, but RoBERTa instead.
Regards, HaeB (talk) 06:57, 24 February 2023 (UTC)[reply]
  • I'm not convinced that this is a problem that isn't already addressed by the vast number of policies and guidelines on this site. If artificial intelligence ever becomes capable of generating Wikipedia articles that are verifiable, written in a neutral point of view and devoid of original research, then I'm all for it. Until then, the usual system of separating cruft from quality will continue. It's possible that garbage will be generated faster than ever, but that seems like a technical issue rather than a policy one.~TPW 19:46, 21 February 2023 (UTC)[reply]
    Read the proposed policy. I'd look at the LLM policy more as an explanation as to how text generation fits into current policies rather than setting new precedent. The problem is that most people do not understand the policy issues raised by LLMs. The proposed policy explains them. — rsjaffe 🗣️ 20:29, 21 February 2023 (UTC)[reply]
  • If AI is integrated into Wikipedia to write articles and whatnot, we should have some sort of Pending Changes Protection when AI is used so we could double-check the accuracy of the article(s). ‍ ‍ Helloheart ‍ 00:49, 22 February 2023 (UTC)[reply]
    If robots can reliably write articles of ordinary WP quality, then there's no need for WP. People who want to know something can just ask the robot and get an answer tailored to the asker's known preferences and knowledge. Jim.henderson (talk) 20:46, 22 February 2023 (UTC)[reply]
    Wouldn't that make for a bubble, though? Here's an article to prove whatever you already believe! Adam Cuerden (talk)Has about 8.2% of all FPs. Currently celebrating his 600th FP! 03:14, 23 February 2023 (UTC)[reply]
    How do I know I'm smart? My computer friend always tells me I'm right, that's how. Tyrants have suffered bad advice from ego-stroking yes-men forever; now everyone can be Ethylred the Unready.Jim.henderson (talk) 02:49, 26 February 2023 (UTC)[reply]
  • Hallucination is a major problem with these models, anyone not verifying each and everything they are using AI tools for should be immediately sanctioned. There was an interesting discussion to that effect on Villagepump I believe, even before these tools came into vogue. Gotitbro (talk) 13:10, 1 March 2023 (UTC)[reply]
  • Good and timely article. I also tried out ChatGPT to see what it can do. It's documented on this page. My overall conclusion was that "it appears that us Wikipedia volunteers aren't out of a job just yet". Schwede66 22:23, 7 March 2023 (UTC)[reply]
  • After reading another article about how ChatGPT is coming for all our jobs, I signed up and asked it (3.5 I presume) "how many neutrons are in a liter of water" (yes, it insisted on that spelling of litre). It wrote out a four or five paragraph reply explaining exactly how it arrived at the figure of 556 neutrons. Typing the same into the integrated engine in Bing resulted in the claim that a litre of water has no neutrons. 4.0 did somewhat better, off by only two orders of magnitude. So I think the bit about "AI-generated text is not reliably correct" needs to be bolded. Maury Markowitz (talk) 20:34, 18 March 2023 (UTC)[reply]