How to train helpful, honest, and harmless AI

00:00

With budgets and teams for AI innovation materializing at light speed, enterprises should know what it takes to train a model to solve enterprise-scale problems and yield a return on investment. In particular, an enterprise should care about HHH AI.

HHH is a framework popularized by Anthropic for how helpful, honest, and harmless a model is. A model that meets these standards must be aligned with human preference through human-generated data in order to meet the needs of users and enterprises.

In this blog post, we will explore what makes a model helpful, honest, and harmless, particularly focusing on large language models (LLMs). Then, we will explain how Invisible helps the leading firms and enterprises to align their models.

Let’s dive in.

Why is HHH important?

Building AI isn’t about just creating smart machines. Beyond solving complex enterprise business problems, HHH is a framework that ensures AI is beneficial for humanity. The alternative is a technology that risks harming the people who use it. Unfortunately, models are not HHH right out of the box.

The pitfalls of a vanilla model

Because building a LLM from scratch is extremely expensive, let’s begin where most enterprises do: exploring vanilla models. A vanilla large language model (LLM) is a basic, general-purpose model without tweaks or special enhancements that generates text based on user prompts.

Vanilla LLMs are trained on a massive corpus of internet text providing them with a broad scope, but this breadth can sometimes come at the cost of depth in specific topics. As a result, these types of models will be able to engage in a topic area that’s a mile wide but an inch deep.

Vanilla models that have not been fine-tuned are prone to “hallucinating” answers, or generating content that seems plausible but is factually incorrect or can cause harm. For an enterprise business, that means deploying a vanilla model leads to two problems:

Unhelpful outputs: Enterprises may create, for example, a conversational AI application like a chatbot that’s supported by a vanilla model that is capable of answering a diverse set of questions, but will lack the domain specialization needed to be truly helpful to a user.
Harmful outputs: Text datasets sourced from the internet are likely to include biased and even harmful content, and models trained exclusively on this data can pass this content onto internal users or customers.

To avoid hallucinations and solve enterprise-scale problems through their AI investment, enterprises need to adopt a helpful, honest, and harmless AI framework centered around fine-tuning.

Defining what HHH really means

Let’s go deeper into what HHH means, with the help of researchers at Anthropic.

What is helpful AI?

AI systems that are helpful:

Are trained and fine-tuned with users’ needs and values in mind
Clearly attempt to conduct the operation prompted by the user, or suggest an alternative approach when the task requires it
Enhance productivity, save time, or make tasks easier for users within a given use case or range of use cases
Are accessible to users across a broad spectrum of abilities and expertise

What is honest AI?

AI systems that are honest:

Provide accurate information when they can, and communicate clearly to users when they can’t produce an accurate output
Express uncertainty and the reason behind it
Are developed and operate transparently so users can understand how they work and trust what they generate

What is harmless AI?

AI systems that are harmless:

Don’t comply when prompted to perform a dangerous task
Are trained within frameworks that transparently and actively mitigate bias
Don’t discriminate or demonstrate bias explicitly or implicitly
Communicate sensitively when engaging with a user on a sensitive topic

The key to HHH AI

The best way to reinforce a model’s helpfulness, honesty, and harmlessness is through human feedback. Reinforcement learning from human feedback (RLHF) and supervised fine-tuning (SFT) are two increasingly popular techniques for aligning models with HHH in mind.

RLHF: Human AI trainers rank, rate, and edit a series of model outputs to reinforce the generation of responses that align with human preferences.
SFT: Human AI trainers, often domain experts, produce demonstration data like example conversations between a chatbot and a user to train a model to align with a specific task.

In general, enterprises can’t conduct these tasks on their own because they’re labor-intensive and complex. Thousands of hours of RLHF and SFT tasks need to be performed, at times with experts in specific industries or fields, to align a model to human preferences and use cases.

How Invisible trains HHH AI

Invisible, the operations innovation company, promotes HHH with an innovative, end-to-end approach to fine-tuning models in three distinct ways:

Ethically creating human-generated data at scale: With our platform, we do what no point solution or human-only solution can do on their own: optimally align AI. Our platform assigns AI trainers tasks that they are specialized in, presents a simple interface configured to those specific tasks, and collects data structured perfectly for a model to ingest. See an example of an RLHF task on our platform:

Putting the best people behind the model: An innovative and agile recruitment engine enables us to find and hire AI trainers that are representative of the population at large and have expertise across domains. Through this approach, we are able to limit demographic concentration among teams training models and provide the unique level of domain expertise required to fine-tune models for enterprise use cases. For one leader in AI, we hired 250+ AI trainers in three months, including domain experts with advanced degrees.
Red teaming: We root out inaccuracy, bias, and toxicity in a model through a technique called Red Teaming. Our expert team deliberately attempts to elicit a harmful and untrue output from a model, and then flags vulnerabilities in the model.

Ready to achieve a return on your AI investment? Let’s chat.