I fact-checked ChatGPT with Bard, Claude, and Copilot – and this AI was the most confidently incorrect

Abstract AI room with colorful lights on the walls

marian/Getty Images

Generative artificial intelligence (AI) is notoriously prone to factual errors. So, what do you do when you’ve asked ChatGPT to generate 150 presumed facts and you don’t want to spend an entire weekend confirming each by hand?

Also: AI in 2023: A year of breakthroughs that left no human thing unchanged

Well, in my case, I turned to other AIs. In this article, I’ll explain the project, consider how each AI performed in a fact-checking showdown, and provide some final thoughts and cautions if you also want to venture down this maze of twisty, little passages that are all alike.

The project

Last week, we published a very fun project where we had DALL-E 3, running inside ChatGPT, generate 50 picturesque images that it thought represented each US state. I also had ChatGPT list “the three most interesting facts you know about the state”. The results were, as my editor put in the article’s title, “gloriously strange”.

ChatGPT put the Golden Gate Bridge somewhere in Canada. The tool put Lady Liberty both in the midwest US and somewhere on Manhattan island. And it generated two Empire State Buildings. In short, ChatGPT got its abstract expressionism funk on, but the results were pretty cool.

Also: I asked DALL-E 3 to create a portrait of every US state, and the results were gloriously strange

As for the individual facts, they were mostly on target. I’m pretty good with US geography and history, and thought that few of ChatGPT’s generated facts stood out as wildly wrong. But I didn’t do any independent fact checking. I just read the results over and pronounced them good enough.

But what if we really want to know the accuracy of those 150 fact bullets? That kind of question seems like an ideal project for an AI.

Methodology

So here’s the thing. If GPT-4, the OpenAI large language model (LLM) used by ChatGPT Plus, generated the fact statements, I wasn’t entirely convinced it should be checking them. That’s like asking high school students to write a history paper without using any references, and then self-correct their work. They’re already starting with suspect information — and then you’re letting them correct themselves? No, that doesn’t sound right to me.

Also: Two breakthroughs made 2023 tech’s most innovative year in over a decade

But what if we fed those facts to other LLMs inside of other AIs? Both Google’s Bard and Anthropic’s Claude have their own LLMs. Bing uses GPT-4, but I figured I’d test its responses just to be completionist.

As you’ll see, I got the best feedback from Bard, so I fed its responses back into ChatGPT in a round-robin perversion of the natural order of the universe. It was a cool project.

Anthropic Claude

Claude uses the Claude 2 LLM, which is also used inside of Notion’s AI implementation. Claude allowed me to feed it a PDF containing the full set of facts (without the pictures). Here’s what I got back:

anthropic-claude

Screenshot by David Gewirtz/ZDNET

Overall, Claude found the fact list to be mostly accurate, but it did have some clarifications for three items. I limited how long the ChatGPT facts could be, and that limit inhibited nuance in the fact descriptions. Claude’s fact check took issue with some of that lack of nuance.

Overall, it was an encouraging response.

Copilot… or nopilot?

Then we get to Microsoft’s Copilot, the renamed Bing Chat AI. Copilot doesn’t allow PDFs to be uploaded, so I tried pasting in the text from all 50 state facts. This approach failed immediately, because Copilot only accepts prompts up to 2,000 characters:

ms-copilot-limits

Screenshot by David Gewirtz/ZDNET

I asked Copilot the following:

The following text contains state names followed by three facts for each state. Please examine the facts and identify any that are in error for that state

Here’s what I got back:

copilot3

Screenshot by David Gewirtz/ZDNET

It pretty much repeated the fact data I asked it to check. So, I tried to guide it with a more forceful prompt:

copilot4

Screenshot by David Gewirtz/ZDNET

Once again, it gave me back the data I asked it to verify. I found this output very odd because Copilot uses the same LLM as ChatGPT. Clearly, Microsoft has tuned it differently than ChatGPT.

I gave up, and moved onto Bard.

Bard

Google has just announced their new Gemini LLM. I don’t yet have access to Gemini, so I ran these tests on Google’s PaLM 2 model.

Also: What is Gemini? Everything you should know about Google’s new AI model

By comparison to Claude and Copilot, Bard knocked it out of the park, or, more Shakespearianish, it “doth bestride the narrow world like a Colossus.”

Check out the results below:

bard

Screenshot by David Gewirtz/ZDNET

It’s important to note that many state facts aren’t even agreed upon by the states or there are nuances. As I’ll show you in the next section, I fed this list back to ChatGPT and it found two discrepancies in the Alaska and Ohio answers.

But there are other misses here. In some ways, Bard overcompensated for the assignment. For example, Bard correctly stated that other states besides Maine produce lobsters. But Maine goes all-in on its lobster production. I’ve never been to another state that has miniature lobster traps as one of the most popular tourist trap trinkets.

Also: I spent a weekend with Amazon’s free AI courses, and highly recommend you do too

Or let’s pick Nevada and Area 51. ChatGPT said, “Top-secret military base, rumored UFO sightings.” Bard tried to correct, saying “Area 51 isn’t just rumored to have UFO sightings. It’s a real top-secret military facility, and its purpose is unknown.” They’re saying pretty much the same thing. Bard just missed the nuance that comes from having a tight word limit.

Another place Bard picked on ChatGPT without understanding context was Minnesota. Yes, Wisconsin has a lot of lakes, too. But Bard didn’t claim Minnesota had the most lakes. It just described Minnesota as the “Land of 10,000 lakes,” which is one of Minnesota’s most common slogans.

Bard got hung up on Kansas as well. ChatGPT said Kansas is “Home to the geographic center of the contiguous US.” Bard claimed it was South Dakota. And that would be true if you factor in Alaska and Hawaii. But ChatGPT said “contiguous,” and that honor goes to a point near Lebanon, Kansas.

Also: These are the jobs most likely to be taken over by AI

I could go on, and I will in the next section, but you get the point. Bard’s fact-checking seems impressive, but it often misses the point and gets things just as wrong as any other AI.

Before we move on to ChatGPT’s limited fact check of Bard’s fact check, let me point out that most of Bard’s entries were either wrong or wrong-headed. And yet, Google puts its AI answers in front of most search results. Does that concern you? It sure worries me.

Such a wonder, my lords and ladies, is not to be spoken of.

ChatGPT

Right off the top, I could tell Bard got one of its facts wrong — Alaska is far bigger than Texas. So, I thought, let’s see if ChatGPT can fact-check Bard’s fact check. For a moment, I thought this bit of AI tail chasing might knock the moon out of Earth’s orbit, but then I decided that I would risk the entire structure of our universe because I knew you’d want to know what happened:

Here’s what I fed ChatGPT:

chatgpt-query

Screenshot by David Gewirtz/ZDNET

And here’s what ChatGPT said (and, for clarity, the moon did remain in orbit):

chatgpt-cross-check

Screenshot by David Gewirtz/ZDNET

As you can see, ChatGPT took issue with Bard’s erroneous claim that Texas is the biggest state. It also had a bit of tizzy over Ohio vs. Kansas as the birth of aviation, which is more controversial than most schools teach.

Also: 7 ways to make sure your data is ready for generative AI

It’s commonly accepted that Wilbur and Orville Wright flew the first aircraft (actually in Kitty Hawk, North Carolina), although they built their Wright Flyer in Dayton, Ohio. That said, Sir George Cayley (1804), Henri Giffard (1852), Félix du Temple (1874), Clément Ader (1890), Otto Lilienthal (1891), Samuel Langley (1896), Gustave Whitehead (1901), and Richard Pearse (1902) — from New Zealand, the UK, France, Germany, and other parts of the US — all have somewhat legitimate claims to being the first in flight.

But we’ll give the point to ChatGPT, because it only has 10 words to make a claim, and Ohio was where the Wright Brothers had their bike shop.

Conclusions and caveats

Let’s get something out of the way upfront: if you’re turning in a paper or a document where you need your facts to be right, do your own fact-checking. Otherwise, your Texas-sized ambitions might get buried under an Alaska-sized problem.

As we saw in our tests, the results (as with Bard) can look quite impressive, but be completely or partially wrong. Overall, it was interesting to ask the various AIs to crosscheck each other, and this is a process I’ll probably explore further, but the results were only conclusive in how inconclusive they were.

Copilot gave up completely, and simply asked to go back to its nap. Claude took issue with the nuance of a few answers. Bard hit hard on a whole slew of answers — but, apparently, to err is not only human, it’s AI as well.

Also: These 5 major tech advances of 2023 were the biggest game-changers

In conclusion, I must quote the real Bard and say, “Confusion now hath made his masterpiece!”

What do you think? What sort of egregious errors have you seen from your favorite AI? Are you content in trusting the AIs for facts, or will you now do your own fact-checking processes? Let us know in the comments below.


You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter on Substack, and follow me on Twitter at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.

Source Link

LEAVE A REPLY

Please enter your comment!
Please enter your name here