AI can do a lot of things extremely well. One thing that it can do just okay — which, frankly, is still quite extraordinary — is write college term papers.
That’s the finding from EduRef, a resource for students and educators, which ran an experiment to determine if a deep learning language prediction model known as GPT-3 could get passing marks in an anonymized trial.
“We hired a panel of professors to create a writing prompt, gave it to a group of recent grads and undergraduate-level writers, and fed it to GPT-3 and had the panel grade the anonymous submissions and complete a follow up survey for thoughts about the writers,” according to an EduRef post. The results were a surprising demonstration of the natural-language prowess of AI.
The specific AI — GPT-3, for Generative Pre-trained Transformer 3 — was released in June 2020 by OpenAI, a research business co-founded by Elon Musk. It was developed to create content with a human language structure better than any of its predecessors. Natural language processing has been developing swiftly in the past couple years, enabling computers to generate text that feels, in many cases, contextually appropriate and passably organic. However, the hurdles for advanced natural language processing are enormous. According to a 2019 paper by the Allen Institute of Artificial Intelligence, machines fundamentally lack commonsense reasoning — the ability to understand what they’re writing. That finding is based on a critical reevaluation of standard tests to determine commonsense reasoning in machines, such as the Winograd Schema Challenge.
Which makes the results of the EduRef experiment that much more striking. The writing prompts were given in a variety of subjects, including U.S. History, Research Methods (Covid-19 Vaccine Efficacy), Creative Writing, and Law. GPT-3 managed to score a “C” average across four subjects from professors, failing only one assignment. The AI scored the highest grades in U.S. History and Law writing prompts, earning a B- in both assignments. GPT-3 scored a “C” in a research paper on Covid-19 Vaccine Efficacy, scoring better than one human writer.
Overall, the instructor evaluations suggested that writing produced by GPT-3 was able to mimic human writing in areas of grammar, syntax, and word frequency, although the papers felt somewhat technical. As you might expect, the time it took the AI to complete the assignments was dramatically less than that required by human participants. The average time between assignment and completion for humans was 3 days, while the average time between assignment and completion for GPT-3 was between 3 and 20 minutes.
“Even without being augmented by human interference, GPT-3’s assignments received more or less the same feedback as the human writers,” according to EduRef. “While 49.2% of comments on GPT-3’s work were related to grammar and syntax, 26.2% were about focus and details. Voice and organization were also mentioned, but only 12.3% and 10.8% of the time, respectively. Similarly, our human writers received comments in nearly identical proportions. Almost 50% of comments on the human papers were related to grammar and syntax, with 25.4% related to focus and details. Just over 13% of comments were about the humans’ use of voice, while 10.4% were related to organization.”
Aside from potentially troubling implications for educators, what this points to is a dawning inflection point for natural language processing, heretofore a decidedly human characteristic.