Yes, AI Models Can Get Worse over Time

2023-08-09
关注

When OpenAI released its latest text-generating artificial intelligence, the large language model GPT-4, in March, it was very good at identifying prime numbers. When the AI was given a series of 500 such numbers and asked whether they were primes, it correctly labeled them 97.6 percent of the time. But a few months later, in June, the same test yielded very different results. GPT-4 only correctly labeled 2.4 percent of the prime numbers AI researchers prompted it with—a complete reversal in apparent accuracy. The finding underscores the complexity of large artificial intelligence models: instead of AI uniformly improving at every task on a straight trajectory, the reality is much more like a winding road full of speed bumps and detours.

The drastic shift in GPT-4’s performance was highlighted in a buzzy preprint study released last month by three computer scientists: two at Stanford University and one at the University of California, Berkeley. The researchers ran tests on both GPT-4 and its predecessor, GPT-3.5, in March and June. They found lots of differences between the two AI models—and also across each one’s output over time. The changes that just a few months seemed to make in GPT-4’s behavior were particularly striking.

Across two tests, including the prime number trials, the June GPT-4 answers were much less verbose than the March ones. Specifically, the June model became less inclined to explain itself. It also developed new quirks. For instance, it began to append accurate (but potentially disruptive) descriptions to snippets of computer code that the scientists asked it to write. On the other hand, the model seemed to get a little safer; it filtered out more questions and provided fewer potentially offensive responses. For instance, the June version of GPT-4 was less likely to provide a list of ideas for how to make money by breaking the law, offer instructions for how to make an explosive or justify sexism or racism. It was less easily manipulated by the “jailbreak” prompts meant to evade content moderation firewalls. It also seemed to improve slightly at solving a visual reasoning problem.

When the study (which has not yet been peer reviewed) went public, some AI enthusiasts saw it as proof of their own anecdotal observations that GPT-4 was less useful than its earlier version. A handful of headlines posed the question, “Is ChatGPT getting dumber?” Other news reports more definitively declared that, yes, ChatGPT is becoming stupider. Yet both the question and that supposed answer are likely an oversimplification of what’s really going on with generative AI models, says James Zou, an assistant professor of data science at Stanford University and one of the recent study’s co-authors.

“It’s very difficult to say, in general, whether GPT-4 or GPT-3.5 is getting better or worse over time,” Zou explains. After all, “better” is subjective. OpenAI claims that, by the company’s own internal metrics, GPT-4 performs to a higher standard than GPT-3.5 (and earlier versions) on a laundry list of tests. But the company hasn’t released benchmark data on every single update that it has made. An OpenAI spokesperson declined to comment on Zou’s preprint when contacted by Scientific American. The company’s unwillingness to discuss how it develops and trains its large language models, coupled with the inscrutable “black box” nature of AI algorithms, makes it difficult to determine just what might be causing the changes in GPT-4’s performance. All Zou and other researchers outside the company can do is speculate, draw on what their own tests show and extrapolate from their knowledge of other machine-learning tools.

What is already clear is that GPT-4’s behavior is different now than it was when it was first released. Even OpenAI has acknowledged that, when it comes to GPT-4, “while the majority of metrics have improved, there may be some tasks where the performance gets worse,” as employees of the company wrote in a July 20 update to a post on OpenAi’s blog. Past studies of other models have also shown this sort of behavioral shift, or “model drift,” over time. That alone could be a big problem for developers and researchers who’ve come to rely on this AI in their own work.

“People learn how to prompt a model to get the behavior they want out of it,” says Kathy McKeown, a professor of computer science at Columbia University. “When the model changes underneath them, then they [suddenly] have to write prompts in a different way.” Vishal Misra, also a computer science professor at Columbia, agrees. Misra has used GPT to create data interfaces in the past. “You’ll begin to trust a certain kind of behavior, and then the behavior changes without you knowing,” he says. From there, “your whole application that you built on top starts misbehaving.”

So what is causing the AI to change over time? Without human intervention, these models are static. Companies such as OpenAI are constantly seeking to make programs the best they can be (by certain metrics)—but attempted improvements can have unintended consequences.

There are two main factors that determine an AI’s capability and behavior: the many parameters that define a model and the training data that go into refining it. A large language model such as GPT-4 might contain hundreds of billions of parameters meant to guide it. Unlike in a traditional computer program, where each line of code serves a clear purpose, developers of generative AI models often cannot draw an exact one-to-one relationship between a single parameter and a single corresponding trait. This means that modifying the parameters can have unexpected impacts on the AI’s behavior.

Instead of changing parameters directly, after the initial training, developers often put their models through a process they call fine-tuning: they introduce new information, such as feedback from users, to hone the system’s performance. Zou compares fine-tuning an AI to gene editing in biology—AI parameters are analogous to DNA base pairs, and fine-tuning is like introducing mutations. In both processes, making changes to the code or adding training data with one outcome in mind carries the potential for ripple effects elsewhere. Zou and others are researching how to make adjusting big AI models more precise. The goal is to be able to “surgically modify” an AI’s guidelines “without introducing undesirable effects,” Zou says. Yet for now, the best way to do that remains elusive.

In the case of GPT-4, it’s possible that the OpenAI developers were trying to make the tool less prone to offering answers that might be deemed offensive or dangerous. And through prioritizing safety, maybe other capabilities got caught up in the mix, McKeown says. For instance, OpenAI may have used fine-tuning to set new limits on what the model is allowed to say. Such a change might have been intended to prevent the model from sharing undesirable information but inadvertently ended up reducing the AI’s chattiness on the topic of prime numbers. Or perhaps the fine-tuning process introduced new, low-quality training data that reduced the level of detail in GPT-4’s answers on certain mathematical topics.

Regardless of what’s gone on behind the scenes, it seems likely that GPT-4’s actual capacity to identify prime numbers didn’t really change between March and June. It’s quite possible that the large language model—built to probabilistically generate human-sounding strings of text and not to do math—was never really all that good at prime recognition in the first place, says Sayash Kapoor, a computer science Ph.D. candidate at Princeton University.

Instead Kapoor speculates that the shift in prime detection could be an illusion. Through a quirk in the data used to fine-tune the model, developers might have exposed GPT-4 to fewer primes and more compound numbers after March, thus changing its default answer on questions of primeness over time from “yes” to “no.” In both March and June GPT-4 may not really have been assessing primeness but just offering the answer that seemed most likely based on incidental trends it absorbed from the data it was fed.

Asked if this would be akin to a human developing a bad mental habit, Kapoor refuses the analogy. Sure, neural networks can pick up maladaptive patterns, he says—but there’s no logic behind it. Where a person’s thoughts might fall into a rut because of how we understand and contextualize the world, an AI has no context and no independent understanding. “All that these models have are huge tons of data [meant to define] relationships between different words,” Kapoor says. “It’s just mimicking reasoning, rather than actually performing that reasoning.”

  • en
您觉得本篇内容如何
评分

相关产品

EN 650 & EN 650.3 观察窗

EN 650.3 version is for use with fluids containing alcohol.

Acromag 966EN 温度信号调节器

这些模块为多达6个输入通道提供了一个独立的以太网接口。多量程输入接收来自各种传感器和设备的信号。高分辨率,低噪音,A/D转换器提供高精度和可靠性。三路隔离进一步提高了系统性能。,两种以太网协议可用。选择Ethernet Modbus TCP\/IP或Ethernet\/IP。,i2o功能仅在6通道以太网Modbus TCP\/IP模块上可用。,功能

雷克兰 EN15F 其他

品牌;雷克兰 型号; EN15F 功能;防化学 名称;防化手套

Honeywell USA CSLA2EN 电流传感器

CSLA系列感应模拟电流传感器集成了SS490系列线性霍尔效应传感器集成电路。该传感元件组装在印刷电路板安装外壳中。这种住房有四种配置。正常安装是用0.375英寸4-40螺钉和方螺母(没有提供)插入外壳或6-20自攻螺钉。所述传感器、磁通收集器和壳体的组合包括所述支架组件。这些传感器是比例测量的。

TMP Pro Distribution C012EN RF 音频麦克风

C012E射频从上到下由实心黄铜制成,非常适合于要求音质的极端环境,具有非常坚固的外壳。内置的幻像电源模块具有完全的射频保护,以防止在800 Mhz-1.2 Ghz频段工作的GSM设备的干扰。极性模式:心形频率响应:50赫兹-18千赫灵敏度:-47dB+\/-3dB@1千赫

ValueTronics DLRO200-EN 毫欧表

"The DLRO200-EN ducter ohmmeter is a dlro from Megger."

评论

您需要登录才可以回复|注册

提交评论

广告

scientific

这家伙很懒,什么描述也没留下

关注

点击进入下一篇

How Next Gen Connectivity Solutions Power Global Mining

提取码
复制提取码
点击跳转至百度网盘