The Curious Case of Behaviour Drift in AI: Is it Breaking ChatGPT?

A deep dive into GPT-4’s underwhelming performance

July 31, 2023

Author Avatar

Have you ever wondered why your AI prompts may not be yielding the results they once did? As someone who uses ChatGPT on a daily basis, from writing code to helping my kids with their homework, I certainly have.

Recently, I started noticing a noticeable degradation in the performance of the revolutionary GPT-4, the latest AI language model from OpenAI. However, I soon discovered that I was not alone in this experience, and other users have been sharing similar concerns.

This realization prompted me to investigate further, and I stumbled upon a research paper from esteemed institutions titled “How Is ChatGPT’s Behavior Changing over Time?”

The study examined the performance of GPT-3.5 and GPT-4 in March and June 2023, shedding light on the mysterious shifts in GPT-4’s behavior. Notably, it wasn’t just my perception; GPT-4’s capabilities seemed to have undergone significant fluctuations during this period.

In this article, we will explore the developments in AI model research, examining the possible causes of GPT-4’s perceived decline and investigating whether the technology’s widespread use has played a role. We’ll also explore OpenAI’s approach to model releases, the implications of operating with a closed black box model, and how transparency and standardization could pave the way for a more reliable and consistent AI landscape.

Research findings from Stanford University and the University of California, suggests fluctuations in the performance of GPT-4, OpenAI’s most recent large language model (LLM).

The study fuels an unproven yet growing notion that the AI has gradually worsened in coding and compositional tasks in recent months.

Lingjiao Chen, Matei Zaharia, and James Zou, the brains behind the study “How Is ChatGPT’s Behavior Changing over Time?”, used API access to examine the performance of the GPT-3.5 and GPT-4 models in March and June 2023.

They evaluated the models based on several tasks, including math problem-solving, code generation, and visual reasoning. Interestingly, while the ability of GPT-4 to identify prime numbers drastically dropped from a 97.6% accuracy rate in March to a mere 2.4% in June, GPT-3.5 surprisingly showed an uptick in its performance during the same period.

Whereas OpenAI has staunchly maintained that GPT-4 remains unchanged in terms of its capabilities. Some researchers, argue that any perceived degradation in GPT-4’s capabilities may simply be the novelty of LLMs wearing off. As the technology has become more commonplace, its flaws seem glaring.

Peter Welinder, OpenAI’s VP of Product, firmly refuted all the rumors surrounding GPT-4, asserting,

No, we haven’t made GPT-4 dumber… When you use it more heavily, you start noticing issues you didn’t see before.

Meanwhile, OpenAI is closely following the discourse and has declared its awareness of the reported performance regressions. Logan Kilpatrick, OpenAI’s head of developer relations, announced that the team is “looking into it.”

The study’s release coincides with growing complaints about a perceived dip in GPT-4’s performance. Theories abound, ranging from OpenAI’s “distilling” models to save computational resources, fine-tuning to minimize harmful outputs, and even conspiracy theories suggesting a deliberate reduction of GPT-4’s coding abilities to promote the use of GitHub Copilot.

Here’s a rapid rundown of the core findings uncovered in the research paper.

Solving Math Problems:

The research examined how GPT-4 and GPT-3.5’s math-solving skills evolved over time using the Chain-of-Thought approach. Surprisingly, substantial drifts in performance were observed. GPT-4’s accuracy dropped from 97.6% in March to 2.4% in June, while GPT-3.5’s accuracy improved from 7.4% to 86.8%. The example queries and responses showed that the same prompting approach could lead to significantly different performances due to LLM drifts.

Answering Sensitive Questions:

The study investigated LLM services’ responses to sensitive questions. GPT-4 answered fewer sensitive questions from March (21.0%) to June (5.0%), while GPT-3.5 answered more (from 2.0% to 8.0%). GPT-4 became more terse and provided fewer explanations for refusing to answer inappropriate queries, which may indicate an increased focus on safety but reduced rationale. Jailbreaking attacks posed a significant threat to LLM services’ safety, and GPT-4’s update offered better defense against such attacks compared to GPT-3.5.

Code Generation:

The study assessed code generation ability, using a new dataset to avoid data contamination issues. Both GPT-4 and GPT-3.5 showed a decline in the percentage of directly executable code generations from March to June. GPT-4’s directly executable code dropped from 52.0% to 10.0%, while GPT-3.5’s dropped from 22.0% to 2.0%. The LLMs tended to add extra non-code text, rendering the code not executable.

Visual Reasoning:

Visual reasoning tasks require abstract reasoning, and the study used the ARC dataset to assess LLM services’ performance. Both GPT-4 and GPT-3.5 demonstrated marginal improvements in the exact match rate from March to June. However, GPT-4 made mistakes on queries where it was correct in March, indicating the need for fine-grained drift monitoring in critical applications.

This study leads us into an exploration of the phenomenon called “model drift,” as well as how this may potentially be impacting GPT-4.

What does model drift imply?
The concept of model drift signifies the diminishing performance of machine learning models over time — meaning the model's predictions become less accurate than they were during the training phase, either gradually or suddenly.

To put it in simpler terms, consider a car veering off course gradually without an immediate course correction. This scenario mirrors what happens in AI: models, which were trained on historical data, begin to lose their predictive precision when the trends and patterns in the data they’re processing start to change over time.

Types of Drift:

Concept Drift: This refers to changes in the relationships between input variables and the output variable over time. For instance, in a product recommendation model, a new trend or fashion may alter consumers’ preferences, changing the relationship between the input (consumer demographics, past purchase behavior, etc.) and the output (whether the consumer buys the product).

Data Drift: This refers to changes in the distributions of input variables over time. For example, if a machine learning model uses demographic data to make predictions, and a significant demographic shift happens in the population, this can lead to data drift.

The Black Box Dilemma: The Closed Approach of OpenAI

Amidst these explorations of model drift, OpenAI’s approach to AI comes under scrutiny. GPT-4 was released without disclosing the source of training materials, source code, neural network weights, or even a paper describing its architecture. As a result, researchers find themselves grappling with a closed black box model that may harbor unknown components and could change without warning.

This approach leaves software developers at the mercy of unpredictable and untraceable changes in models. Daniel Jeffries, a writer and futurist, suggests that AI vendors should provide long-term support for older versions of models when rolling out changes. This would provide a stable foundation for developers.

Open-source or source-available models such as Meta’s Llama could offer a solution. Such models enable researchers to work from a shared baseline and deliver repeatable results, regardless of potential changes made by a company like OpenAI.

The Impact on Prompt Engineering and Potential Solutions

So, how does this intriguing behavior drift of large language models affect prompt engineering, the art of crafting precise prompts to elicit desired responses from an AI model? Well, as the model drifts, its responses to the same prompts might vary over time. You may have crafted the perfect prompt for GPT-4 back in March, but by June, that very same prompt might be producing less accurate or helpful results.

Think about it like asking a friend for directions. You used to ask your friend, “What’s the quickest way to the supermarket?” and they always gave you the right answer. But over time, they start to suggest slower routes, or perhaps even lead you to the wrong destination! In the world of AI, this can prove quite a challenge, particularly for applications relying heavily on prompt engineering for accuracy and user satisfaction.

So, how can we overcome this? The study underlines the importance of monitoring and reassessing prompts periodically. Just like you might change your questions or provide more context to your friend, we need to update our prompts as AI models evolve. Regular check-ins can ensure that your AI model is still understanding and responding to prompts as intended. But remember, this isn’t a one-size-fits-all solution. The frequency and approach to updating prompts might vary depending on the specific use case or the level of model drift observed.

The Future of AI: Transparency and Standardization

The current discussion around GPT-4 brings into focus the need for increased transparency and standardization in AI research. The research from esteemed institutions such as Stanford University and the University of California is an indication of the need for clarity and consistency in the field.

In conclusion, it all comes back to transparency and standardization. Having access to model updates and understanding how and why changes occur could provide prompt engineers with the necessary tools to adapt swiftly. For AI to reach its full potential and to maintain its utility over time, it’s clear we need more openness in how models change, better guidelines, and the ability to customize models to specific use cases.

Model drift doesn’t mean we’re out of directions — it simply means we need to adjust our navigation strategies to keep up with the changing landscape. Therefore, let’s continue exploring, experimenting, and adapting our prompts for a smoother AI journey, and advocate for increased transparency and standardization in AI research.


Author Avatar

Hammad Abbasi

Hammad Abbasi is an AI enthusiast, software architect, and tech blogger. He is passionate about decoding AI, building enterprise-grade apps, and exploring the world of startups and programming. With a keen interest in cutting-edge technologies, he shares valuable insights and knowledge with his readers. Connect with him to stay updated on the latest trends in the tech industry.