Yes, AI Models Can Get Worse over Time

2023-08-09

关注

When OpenAI released its latest text-generating artificial intelligence, the large language model GPT-4, in March, it was very good at identifying prime numbers. When the AI was given a series of 500 such numbers and asked whether they were primes, it correctly labeled them 97.6 percent of the time. But a few months later, in June, the same test yielded very different results. GPT-4 only correctly labeled 2.4 percent of the prime numbers AI researchers prompted it with—a complete reversal in apparent accuracy. The finding underscores the complexity of large artificial intelligence models: instead of AI uniformly improving at every task on a straight trajectory, the reality is much more like a winding road full of speed bumps and detours.

The drastic shift in GPT-4’s performance was highlighted in a buzzy preprint study released last month by three computer scientists: two at Stanford University and one at the University of California, Berkeley. The researchers ran tests on both GPT-4 and its predecessor, GPT-3.5, in March and June. They found lots of differences between the two AI models—and also across each one’s output over time. The changes that just a few months seemed to make in GPT-4’s behavior were particularly striking.

Across two tests, including the prime number trials, the June GPT-4 answers were much less verbose than the March ones. Specifically, the June model became less inclined to explain itself. It also developed new quirks. For instance, it began to append accurate (but potentially disruptive) descriptions to snippets of computer code that the scientists asked it to write. On the other hand, the model seemed to get a little safer; it filtered out more questions and provided fewer potentially offensive responses. For instance, the June version of GPT-4 was less likely to provide a list of ideas for how to make money by breaking the law, offer instructions for how to make an explosive or justify sexism or racism. It was less easily manipulated by the “jailbreak” prompts meant to evade content moderation firewalls. It also seemed to improve slightly at solving a visual reasoning problem.

When the study (which has not yet been peer reviewed) went public, some AI enthusiasts saw it as proof of their own anecdotal observations that GPT-4 was less useful than its earlier version. A handful of headlines posed the question, “Is ChatGPT getting dumber?” Other news reports more definitively declared that, yes, ChatGPT is becoming stupider. Yet both the question and that supposed answer are likely an oversimplification of what’s really going on with generative AI models, says James Zou, an assistant professor of data science at Stanford University and one of the recent study’s co-authors.

“It’s very difficult to say, in general, whether GPT-4 or GPT-3.5 is getting better or worse over time,” Zou explains. After all, “better” is subjective. OpenAI claims that, by the company’s own internal metrics, GPT-4 performs to a higher standard than GPT-3.5 (and earlier versions) on a laundry list of tests. But the company hasn’t released benchmark data on every single update that it has made. An OpenAI spokesperson declined to comment on Zou’s preprint when contacted by Scientific American. The company’s unwillingness to discuss how it develops and trains its large language models, coupled with the inscrutable “black box” nature of AI algorithms, makes it difficult to determine just what might be causing the changes in GPT-4’s performance. All Zou and other researchers outside the company can do is speculate, draw on what their own tests show and extrapolate from their knowledge of other machine-learning tools.

What is already clear is that GPT-4’s behavior is different now than it was when it was first released. Even OpenAI has acknowledged that, when it comes to GPT-4, “while the majority of metrics have improved, there may be some tasks where the performance gets worse,” as employees of the company wrote in a July 20 update to a post on OpenAi’s blog. Past studies of other models have also shown this sort of behavioral shift, or “model drift,” over time. That alone could be a big problem for developers and researchers who’ve come to rely on this AI in their own work.

“People learn how to prompt a model to get the behavior they want out of it,” says Kathy McKeown, a professor of computer science at Columbia University. “When the model changes underneath them, then they [suddenly] have to write prompts in a different way.” Vishal Misra, also a computer science professor at Columbia, agrees. Misra has used GPT to create data interfaces in the past. “You’ll begin to trust a certain kind of behavior, and then the behavior changes without you knowing,” he says. From there, “your whole application that you built on top starts misbehaving.”

So what is causing the AI to change over time? Without human intervention, these models are static. Companies such as OpenAI are constantly seeking to make programs the best they can be (by certain metrics)—but attempted improvements can have unintended consequences.

There are two main factors that determine an AI’s capability and behavior: the many parameters that define a model and the training data that go into refining it. A large language model such as GPT-4 might contain hundreds of billions of parameters meant to guide it. Unlike in a traditional computer program, where each line of code serves a clear purpose, developers of generative AI models often cannot draw an exact one-to-one relationship between a single parameter and a single corresponding trait. This means that modifying the parameters can have unexpected impacts on the AI’s behavior.

Instead of changing parameters directly, after the initial training, developers often put their models through a process they call fine-tuning: they introduce new information, such as feedback from users, to hone the system’s performance. Zou compares fine-tuning an AI to gene editing in biology—AI parameters are analogous to DNA base pairs, and fine-tuning is like introducing mutations. In both processes, making changes to the code or adding training data with one outcome in mind carries the potential for ripple effects elsewhere. Zou and others are researching how to make adjusting big AI models more precise. The goal is to be able to “surgically modify” an AI’s guidelines “without introducing undesirable effects,” Zou says. Yet for now, the best way to do that remains elusive.

In the case of GPT-4, it’s possible that the OpenAI developers were trying to make the tool less prone to offering answers that might be deemed offensive or dangerous. And through prioritizing safety, maybe other capabilities got caught up in the mix, McKeown says. For instance, OpenAI may have used fine-tuning to set new limits on what the model is allowed to say. Such a change might have been intended to prevent the model from sharing undesirable information but inadvertently ended up reducing the AI’s chattiness on the topic of prime numbers. Or perhaps the fine-tuning process introduced new, low-quality training data that reduced the level of detail in GPT-4’s answers on certain mathematical topics.

Regardless of what’s gone on behind the scenes, it seems likely that GPT-4’s actual capacity to identify prime numbers didn’t really change between March and June. It’s quite possible that the large language model—built to probabilistically generate human-sounding strings of text and not to do math—was never really all that good at prime recognition in the first place, says Sayash Kapoor, a computer science Ph.D. candidate at Princeton University.

Instead Kapoor speculates that the shift in prime detection could be an illusion. Through a quirk in the data used to fine-tune the model, developers might have exposed GPT-4 to fewer primes and more compound numbers after March, thus changing its default answer on questions of primeness over time from “yes” to “no.” In both March and June GPT-4 may not really have been assessing primeness but just offering the answer that seemed most likely based on incidental trends it absorbed from the data it was fed.

Asked if this would be akin to a human developing a bad mental habit, Kapoor refuses the analogy. Sure, neural networks can pick up maladaptive patterns, he says—but there’s no logic behind it. Where a person’s thoughts might fall into a rut because of how we understand and contextualize the world, an AI has no context and no independent understanding. “All that these models have are huge tons of data [meant to define] relationships between different words,” Kapoor says. “It’s just mimicking reasoning, rather than actually performing that reasoning.”

您觉得本篇内容如何

评分

声明：本文内容及配图源自互联网收集，目的在于传递更多信息，并不代表本网赞同其观点或证实其内容真实性，不承担此类作品侵权行为的直接责任及连带责任。如涉及作品内容、版权等问题，请联系本网处理，侵权内容将在一周内下架整改。

scientific

这家伙很懒，什么描述也没留下

期刊文献

期刊订阅

免费订阅

传感器专家网邮件期刊为您提供业界最新最快的技术应用与市场资讯

Yes, AI Models Can Get Worse over Time

相关产品

评论

热门资讯

scientific

期刊文献

ＭＥＭＳ微热板结构设计与仿真

基于霍尔脉宽的汽车天窗防夹标定系统设计

振动筒传感器自动增益谐振电路仿真设计和测试

基于ＡｇＮＷｓ＠丙烯酸酯弹性体的柔性应变传感器

基于ＣＮＴｓ／Ｆｅ３Ｏ４的可用于人体动作检测的摩擦纳米发电机

石墨烯在压阻传感器中的应用研究综述

期刊订阅

最新文章

免校准、长寿命，NMP气体泄漏报警器开启高效安全新时代

奔驰，要装国产激光雷达了！

1516亿元！中国智能传感器行业最新数据披露！（全面）

速腾聚创再融资10亿！投向人形机器人传感器研发！

超2.6亿颗传感器增量需求，王传福呼吁加大产能！比亚迪推全民智驾，这些传感器赛道起飞！

相关阅读

如何投资无人机行业:深入研究无人机ETF

本周《无人机黎明》上的G类公司!航空收费，航空等

美国国家航空航天局与“无人机响应者”合作开发紧急响应行动中的自动飞行系统

比较新冠疫情前后智能建筑的物联网部署

人工智能如何重塑研究?

克服智慧工厂的挑战

通过分计量获得有价值的见解

物联网移动应用开发人员所需的技能和应用程序

物联网五云

AI聊天机器人和心理健康

scientific

点击进入下一篇

Yes, AI Models Can Get Worse over Time

相关产品

评论

热门资讯

scientific

期刊文献

ＭＥＭＳ微热板结构设计与仿真

基于霍尔脉宽的汽车天窗防夹标定系统设计

振动筒传感器自动增益谐振电路仿真设计和测试

基于ＡｇＮＷｓ＠丙烯酸酯弹性体的柔性应变传感器

基于ＣＮＴｓ／ Ｆｅ３ Ｏ４的可用于人体动作检测的摩擦纳米发电机

石墨烯在压阻传感器中的应用研究综述

期刊订阅

最新文章

免校准、长寿命，NMP气体泄漏报警器开启高效安全新时代

奔驰，要装国产激光雷达了！

1516亿元！中国智能传感器行业最新数据披露！（全面）

速腾聚创再融资10亿！投向人形机器人传感器研发！

超2.6亿颗传感器增量需求，王传福呼吁加大产能！比亚迪推全民智驾，这些传感器赛道起飞！

相关阅读

如何投资无人机行业:深入研究无人机ETF

本周《无人机黎明》上的G类公司!航空收费，航空等

美国国家航空航天局与“无人机响应者”合作开发紧急响应行动中的自动飞行系统

比较新冠疫情前后智能建筑的物联网部署

人工智能如何重塑研究?

克服智慧工厂的挑战

通过分计量获得有价值的见解

物联网移动应用开发人员所需的技能和应用程序

物联网五云

AI聊天机器人和心理健康

scientific

点击进入下一篇

基于ＣＮＴｓ／Ｆｅ３Ｏ４的可用于人体动作检测的摩擦纳米发电机