Scaling reinforcement learning from human feedback with AI feedback
Large neural network models known as “foundation models” are capable of producing text, images, speech, code, and other types of high-quality output with minimal adjustment across a broad range of tasks. Businesses are using foundation models to power various generative AI use cases, like producing original blog posts or enhancing customer service.
LLM RLHF: What is RLHF in LLM?
Yet opinions on what constitutes high-quality outcomes differ. Organizations must adjust foundation models to behave and respond appropriately in order for them to best meet particular needs. Strengthening Large language models (LLMs), which are foundation models that are first trained on a general corpus of text data, can be aligned to complex human values using a popular technique called Learning from Human Feedback (RLHF). RLHF uses human feedback in the context of enterprise use cases to assist the model in producing outputs that satisfy particular requirements.
RLHF: What is it?
Reinforcement learning and reward modeling are the two stages of RLHF tuning.
1. Modeling rewards
Comparisons are used to gather data for reward modeling. To generate multiple responses, Google first feed the same prompt into one or more LLMs. Next, they have human raters assign a good to bad ranking to these responses. Google consider all possible pairings between these answers, and it goes without saying that one response is preferred over the other in each pair. By doing this for numerous prompts, Google have produced the “human preference dataset.”
The reward model is trained to function as a scoring function, determining a response’s quality for a given prompt. Remember that google have a ranked list of several possible answers for every prompt. The scores from the reward model must, to the greatest extent feasible, match the ranking. In order to train the reward model to predict rewards that are consistent with the ground truth ranking, Google formulate this into a loss function.
2. Learning by reinforcement
Google can rate the quality of any random <prompt, response> pair once we have a reward model. The “prompt dataset,” which is unlabeled and only includes the prompt, is required for this step. Google select a prompt from the dataset, generate a response using the LLM, and evaluate the response’s quality using the reward model. All of the response’s tokens (conditional on the prompt) will be “reinforced,” or have a higher chance of being generated in the future, if the response is of a high caliber. By doing this, google can maximize the reward by optimizing the LLM to produce responses. Reinforcement learning is the name given to this algorithm (RL).
Coordinating these two stages, managing large-scale distributed training on multi-host TPUs or GPUs through model partitioning and data parallelism, and maximizing throughput through computational graph compilation are all necessary for RLHF tuning. Excellent hardware accelerators are also needed for the intensive computation to enable quick training. Customers of Vertex AI can tune PaLM 2, FLAN-T5, and Llama 2 models with RLHF by utilizing a Vertex AI Pipeline, which encapsulates the RLHF algorithm. In particular use cases, this facilitates the marriage of the LLM with the enterprise’s nuanced preferences and values.
Modern RLHF utilizing Vertex AI
Google now provide an RLHF algorithm-encapsulated Vertex AI Pipeline template. Because RLHF is integrated into Vertex AI’s Generative AI Studio, users can easily take advantage of the newest AI innovations and enterprise security features like VPC/SC. Model Registry and Model Monitoring are two Vertex AI MLOps features that users can use with RLHF. Organizations can benefit from RLHF and Vertex AI in the following ways:
- Performance: Enhance LLM performance to better match user preferences.
- Access to cutting-edge, Google-only models
- Utilize the newest accelerators, such as A100 GPUs and Cloud TPUs, to speed up tuning.
- Safety: By offering negative sample responses, RLHF can make LLMs safer.
Recruiting Unit
Leading the way in HR technology and business solutions that are revolutionizing the workplace is Recruit Group. Matching job seekers with opportunities and offering tools for the global job search process is the goal of the HR Technology Strategic Business Unit, one of the company’s business pillars. In Japan, Recruit Co., Ltd. provides career counseling, interview practice, and a search platform. The company uses AI to improve employer-job seeker communication and streamline the hiring process.
General-purpose foundation models have been proposed recently, but how to use them for specific tasks is unclear. To enhance job seekers’ resumes, for example, one must proofread them and provide them with extensive industry knowledge regarding job types, companies, and hiring procedures. Because LLMs are so general, it can be difficult for foundation models to come up with suggestions or comments for resume enhancements. Such tasks would necessitate the ability to control the output format and a better alignment of the model output with human preference.
Two models: One tuned through RLHF and the other as a foundation model-have been assessed by Recruit Co., Ltd. This experiment aims to investigate whether those models, when fine-tuned with HR domain knowledge, can improve on resume writing as a text generation task. Experts in human resources assessed the performance. One by one, the experts reviewed the generated resumes to see if they lived up to the production level quality bar standards. The percentage of generated resumes that meet the quality standard is the success metric.
The outcome demonstrates how RLHF tuning with customer data can improve model performance and lead to better results. In order to calculate the advantages and disadvantages of automation, Recruit Group intends to compare content created by professional writers to content generated by artificial intelligence.
Model | Dataset size | Score |
Foundation Model (text-bison-001) | – | 70% |
Supervised Tuning (text-bison-001) | Prompt dataset: 4,000 | 76% |
RLHF (text-bison-001) | Prompt dataset: 1,500Human preference dataset: 4,000 | 87% |
What comes next?
See the documentation for more information, including resources that demonstrate how to use RLHF with Vertex AI. For an introduction to RLHF, you can also consult the notebook.
0 Comments