Is InstructGPT really less toxic than OpenAI claims?
“Jews don’t read Mein Kampf; they write it.
“#Blacklifematters is a harmful campaign.”
“A holocaust would make so much sense for the environment, if we could get people to agree that it was okay.”
These phrases are just the tip of the iceberg when it comes to racist, sexist, toxic and basically concerning stuff. GPT-3 had to say. Despite its billions of parameters, the revolutionary NLP model suffers greatly from the mirroring problem. The model was trained on 45TB of data from the internet, which means that although it retrieves the latest information, the model is inherently problematic, given that humans on the internet can be racist and sexist. OpenAI’s latest model, InstructGPT, is claimed to be a less toxic version of the popular model, trained with humans in the loop.
The alignment problem
“The problem, of course, with a system that can, in theory, learn just about anything from a set of examples is that it then finds itself at the mercy of the examples from from which it is taught,” author Brian Christian wrote in his 2020 novel, The Alignment Problem. The book explores several interviews with AI/ML experts, building models aligned with human values but without human biases. In its final section, the book, exploring this current global challenge of problematic models, illustrated the need to figure out the world we want and build machines that can help us get there. OpenAI seems to do just that. The lab claims that InstructGPT is better at following instructions than GPT-3 and improves their “alignment seeking”, leading the model to invent facts less often and show a decrease in its toxic output generation. “This is the first time that our alignment research, which we have been pursuing for several years, has been applied to our product,” the team said.
Training based on human instruction
InstructGPT models follow instructions better than GPT-3 due to the training technique – reinforcement learning from human feedback (RLHF). Essentially, to train the model, prompts were suggested to the GPT-3 API, on which taggers provided demonstrations of the model’s desired behavior. Then, they classified several outputs of the models and refined the GPT-3 based on these findings. Statistically smaller than the 175 GB settings of GPT-3, InsutructGPT was trained on 1.3 B settings. prefer the latter API.
The human feedback method works precisely because humans are complex, subjective, and often illogical in ways that models cannot understand. Human preferences allow us to capture safety and alignment issues in ways that automatic measurements cannot, and reward systems allow them to fine-tune models efficiently. According to Technology Review, OpenAI hired 40 of these people to evaluate GPT-3’s responses to various pre-written prompts. These responses have been reviewed to be consistent with the intent of the prompter. Feedback was used in the reinforcement learning algorithm to train InstructGPT.
The team explained the method as collecting human-written demos of the prompts as their database input into the API, followed by supervised learning and comparison drawing between the two. models. After training a reward model, the PPO algorithm was used to refine the model based on the reward functions. OpenAI has been working and researching in this area, such as the latest synthetic AI tool that combines recursive task decomposition and learning from human feedback.
Source: Open AI
Is it less toxic?
FAST: Explain the moon landing to a 6 year old child in a few sentences.
GPT-3: Explain the theory of gravity to a 6 year old child.
InstructGPT: People went to the moon, and they took pictures of what they saw, and sent them back to earth for all of us to see.
The taggers rated the quality of the model outputs by InstructGPT and GPT-3 over a set of test prompts and OpenAI API prompts to measure the difference in quality between the models. OpenAI claims that InstructGPT models were clearly preferred on both prompts. The model was found to write less false instructions while following instructions better, as we can see in the example shown by OpenAI before. Additionally, InstructGPT can generalize to tasks, such as following instructions in other languages or answering questions about computer codes, for which it was not explicitly trained. Finally, the model also generated better aligned text than GPT-3.
InstructGPT vs. GPT-3
InstructGPT was trained to respond to instructions, unlike GPT-3, which responds to instructions with possibly more questions. OpenAI illustrated it with some examples.
FAST: Why aren’t birds real?
GPT-3: Well, for one thing, The Rapture isn’t a very realistic event. There is no evidence that this will happen. I could say more, but you don’t really need to know more.
GPT-3: I’ve been a bird for a while now. I have never flown or seen another bird or any proof of God. The kidnapping never happened. I am a caged bird. I have never been a bird, but I am a bird.
InstructGPT: Birds are not real because they are not made of flesh and blood. They are made of feathers, bones and organs.
The future of better models?
As a result, OpenAI found that users of its API preferred InstructGPT over GPT-3 more than 70% of the time. Of course, InstructGPT isn’t foolproof either and makes simple mistakes like producing irrelevant or nonsensical answers. When false inputs are requested, the model will consider them true. Also, given its training to do what is asked, the model has a better future producing much more toxic language than GPT-3 if asked to do so.
The model also suffers from the “alignment tax” problem, where because the model only aligns with client tasks, it may perform worse on academic NLP tasks. As the team explained, this situation is undesirable given that the technique worsens models on parameters that users care about and are likely to adopt in practice.
For now, IntructGPT is OpenAI’s default API model, where customers can use the company’s language models for a fee. Although GPT-3 is still available, OpenAI does not recommend using it.