Why Your LLM is Misbehaving: Common Causes of AI Failure

“I’m ChatGPT, a language AI model built by OpenAI,” answered one Invisible client’s LLM in response to the prompt, “Who are you?”

Unfortunately, this model was not ChatGPT; nor was it built by OpenAI.

This is the issue that Invisible’s data strategy team set out to solve. Before attempting to solve the model’s identity crisis with a new training dataset, the team conducted an in-depth model evaluation to figure out why it was happening in the first place. Read on to learn what they discovered.

Invisible’s data strategy team has found (and fixed) hundreds of AI failures. In this blog, we’ll explore some of the more common causes of performance loss in LLMs and generative AI models, including unintentional combinations of weights and training data types, the overuse of synthetic data, and low quality human data.

Combinations of weights and training data

In one experiment for a client, the Invisible team hypothesized that the current weights on parameters was causing a tradeoff between the model’s chain of thought reasoning and its ability to deliver safe responses to the user. As the model became more likely to deliver safe content, it lost the ability to reason with nuance. To test this, we observed several models with various adjustments of weights.

Generally, the hypothesis proved true, but one interesting insight emerged: some weights, when combined with a particular training data structure, resulted in a model loss.

This was uncovered during a correlation analysis comparing models trained on RLHF training datasets with models trained on SFT datasets in models with identical weights. The models trained on SFT datasets proved the hypothesis — the model improved on safety, but declined in chain of thought reasoning with the updated set of weights on parameters. However, the model trained on the RLHF dataset showed improvements to safety without degrading its reasoning ability with the same weight adjustment.

If your model experiences performance loss after a weight adjustment in a different area, the structure of your training data may be the culprit.

Low quality pre-training data

Most LLMs have two major stages of training. In pre-training, the model is trained on huge quantities of general data with the goal of getting it to a base level of performance on various benchmarks. In post-training, the model is given more targeted training datasets aimed at improving performance in specific areas, such as reasoning, creativity, coding, etc.

Many LLM performance gaps and losses can be traced to the data originally used to pre-train the model. Because this stage of training requires huge amounts of data, AI developers are more likely to use free or cheap datasets. These datasets usually fall into one of two categories: synthetic data or low-quality human data.

Synthetic data

Synthetic data can be used successfully for training AI. However, when a large amount of synthetic data is used in pre-training, especially for niche subject areas for which there isn’t much readily available data, it can lower model performance.

Remember the model that experienced an identity crisis and erroneously thought that it was ChatGPT? Invisible’s data strategy team discovered that this artificial intelligence failure was caused by 700,000 rows of synthetic data that trained the model to identify itself as ChatGPT. The model’s development team had to remove this dataset from the model’s pre-training, and the subsequent gap was then addressed with human data in post-training.

Synthetic data can also cause problems in post-training. A different client came to the Invisible team with a similar mystery: their model, after ingesting several fine-tuning datasets, had lower performance in many areas, including a nearly 5X increase in grammatical errors in its responses to one use case that was critical to their AI product strategy. The cause, once again, was a large amount of synthetic data that had found its way into these post-training datasets.

While many AI developers view synthetic data as a cost-effective alternative to human data, fixing large-scale errors and gaps caused by these datasets can prove to be more expensive — not to mention time consuming — in the long run. Finding the root cause of why AI fails, and building a targeted, high-quality human dataset just to address that AI failure can end up costing AI teams more than they saved by using synthetic data in the first place.

Low quality human data

Using human data instead of synthetic data should give AI developers better results with their models — unfortunately, the quality of human data can vary widely depending on how these datasets are sourced. Using low quality human data, though less costly at first, can also create model performance issues.

The process for acquiring human data at low costs often comes with limited visibility and control for the AI developer. Typically, the AI team purchases a dataset from a BPO with a specified number of data rows, topics, and structure, and the BPO delivers it without much further insights on how this data was created — including who was tasked with creating the data.

Often, data annotators at BPOs are paid per completed task, so individuals are incentivized to prioritize speed over creativity and diversity. This can result in a dataset with one-note, homogenized content. Even when AI teams purchase large amounts of data, low bars for diversity and creativity can create gaps in the model’s performance and understanding.

Other indicators of lower quality human data from a BPO include:

No visible screening process for data trainers, leading to low or inconsistent levels of human expertise
No clear protocols to ensure trainer safety and well-being
Minimal or inconsistent customer service experiences

‍

Is your LLM acting up? Invisible’s data strategy team has diagnosed and resolved hundreds of misbehaving models with in-depth custom-built evaluations and precise, targeted human datasets. Talk to us to learn how we can get your model back on track.

Why your LLM is misbehaving: common causes of AI failure

Combinations of weights and training data

Low quality pre-training data

Synthetic data

Low quality human data

Want more? Speak to our team

Chat with our team to get expert insights into your unique challenges.