How leading AI teams use synthetic data generation for training their models

One of Invisible’s clients came to us recently with a mysterious case of mistaken identity — a large language model that claimed to be a completely different LLM when prompted. A thorough, custom model evaluation by Invisible’s data strategy team revealed the culprit: 700K rows of synthetic data that found its way into the model’s pre-training dataset.

This revelation prompted a deeper look into the effects of synthetic data — both positive and negative — on AI algorithms, based on the Invisible team’s experience in training the majority of the world’s leading foundation models. Synthetic data is often used as a low-cost alternative to human data or real-world data. Training a complex foundation model can require millions of rows of data, so it’s no surprise that AI engineering and data science teams building machine learning models turn to less costly options. While inexpensive up front, however, the Invisible team has found that synthetic data can actually be more expensive in the long run when used without consideration for specific use cases and model performance goals.

When to use synthetic data — and when to avoid it

In our experience, training an AI model typically requires three different content categories of training data. These are:

Objective content on general topics. These are subjects for which there’s already a lot of widely available content. Examples include basic math, science, and grammar.
Objective content on niche topics. These topics are more difficult to source information in, either because it’s proprietary or lesser known or researched. Examples include law, new research in healthcare, or information pertaining to a specific organization.
Subjective content. These topics involve mimicking human thought, and includes skills like creative writing and chain of thought reasoning.

Synthetic data can work well for the first category, as there’s plenty of readily available information on these topics to mimic, they tend to be definitive and factual, and anonymization is easy. This type of content can easily be covered with a well-curated, high-quality synthetic dataset to increase generative AI model performance.

With the second category, synthetic data can be less effective due to the lack of diversity of content on the subject. If a large synthetic dataset is generated based on only a handful of research papers, for example, the model may end up with gaps in its understanding of the subject. This category can also include topics with sensitive data, making anonymization and data privacy processes necessary before use in training AI models.

Generating synthetic data to train a model on subjective content can lead to more negative outcomes, and end up causing errors, hallucinations, and other issues that can be far more expensive to fix than was saved by using synthetic data in the first place.

How to build hybrid synthetic-human training datasets

Leveraging synthetic data isn’t necessarily a do-or-don’t situation — many leading AI developers have successfully combined synthetic and human data in efforts to save time and costs without compromising on quality or model performance.

Seeded prompts: data rows with synthetic and human content

In one use case, one Invisible client needed to generate training data to improve chain of thought reasoning. Because this data falls in that third category of subjective content, there was no easy or quick alternative for human input in building these datasets. Still, that didn’t mean that every aspect of each data row had to be generated by people. To speed up the process, this client had an AI model generate thousands of prompts. These prompts were then given to human data trainers, who added appropriate answers and scratchpad reasoning to each data row.

By synthetically generating the prompts for the dataset, the client saved a significant number of hours. By ensuring that humans provided the answers and reasoning, the dataset was successful in improving the model’s chain of thought reasoning even though parts of each data row were synthetically generated.

Preference datasets: synthetic data, human choices

Preference datasets are often used for fine tuning models. These datasets present human trainers with one prompt and two answers. The trainer then chooses the superior answer and provides additional context as to why that answer was better, how much better it was than the other, etc.

Creating this dataset is a great opportunity for hybridizing synthetic and human data. Both the prompt and two answers in each data row can be model generated (synthetic), accelerating the data generation process, while human trainers can quickly and easily select their preference and explain their reasoning.

How to target your synthetic datasets

Using synthetic datasets is most effective when it’s highly targeted to address a specific gap or error in your model’s performance — especially if the content needed fits into that first category of objective and widely available topics.

For example, one AI lab had a foundation model that wasn’t meeting some of their benchmarks, delaying their deployment timeline. They estimated that the issue would require 100K rows of human data to address.

When Invisible’s data strategy team conducted their model evaluations, however, their findings revealed that generally, the model performed well — it just had two very specific gaps in its understanding of human professions.

Armed with a more comprehensive understanding of the model’s failure states, Invisible’s team created a custom, targeted approach that included:

One small human dataset of 4K rows to address more complex behavior changes
A larger synthetic dataset to address simple scenarios

With only 4% of their originally anticipated amount of human data alongside a synthetic dataset, the AI lab was able to achieve a 97% improvement in model performance on the benchmarks in question, for far less in time and costs than they had anticipated.

What’s your synthetic data strategy?

If you’re building or fine-tuning an AI model, it’s difficult to ignore the allure of synthetic data — after all, it’s cheaper, faster, and easier to get than human data. But it can end up costing your team much more to fix the problems that synthetic data can create unless you’re strategic about how it was used in the first place.

Synthetic data can be highly beneficial and effective when achieving base performance levels in pre-training on objective and common subjects. It can also be effective for fine-tuning for the same type of content, if it’s built to address specific errors or gaps in understanding based on thorough model evaluations. AI teams can also save time and costs by using hybrid synthetic-human training data for niche topics or subjective content.

How does your AI team use synthetic data?