AMD OLMo 1B Language Models' Benchmark Performance

AMD OLMo is the first AMD 1B language model to be released.

Overview

The rapid advancement of artificial intelligence technologies, particularly large language models (LLMs), has been the subject of recent discussions. These language models, which range from ChatGPT to GPT-4 and Llama, have demonstrated exceptional proficiency in natural language generation, processing, interpretation, and reasoning. We are excited to introduce AMD OLMo, the first collection of fully open 1 billion parameter language models, in keeping with AMD's tradition of sharing code and models to promote community advancement.

The Reasons for Creating Your Own Language Models

Pre-training and fine-tuning your LLM with domain-specific knowledge may help you better link it to specific use cases. By using this approach, companies can tailor the model's architecture and training process to meet their unique requirements, achieving a balance between scalability and specialization that may not be achievable with commercially available models. Pre-training LLMs creates previously unheard-of opportunities for product differentiation and innovation across industries, particularly as the need for customized AI solutions continues to grow.

The AMD OLMo in-house trained series of language models (LMs) are 1 billion parameter LMs that were trained from scratch using billions of tokens on a cluster of AMD Instinct MI250 GPUs. AMD has open-sourced all of its training data and made the milestones for the initial set of AMD OLMo models available, in line with its goal of encouraging accessible AI research.

This project makes it possible for a large community of researchers, developers, and users to study, employ, and train state-of-the-art big language models. By demonstrating AMD Instinct GPUs' capabilities in demanding AI workloads, AMD hopes to demonstrate its capacity to execute massive multi-node LM training projects with trillions of tokens and outperform other fully open LMs of a comparable size in terms of reasoning and instruction-following performance.

Additionally, the community can run these models on AMD Ryzen AI PCs with Neural Processing Units (NPUs) using the AMD Ryzen AI Software, which enables easier local access without privacy concerns, efficient AI inference, and lower power consumption.

The AMD OLMo Language Models Are Unveiled

AMD OLMo is a collection of 1 billion parameter language models that have been pre-trained on 16 nodes using 1.3 trillion tokens and four (4) AMD Instinct MI250 GPUs. Three (3) checkpoints that correspond to the various training phases are being made accessible, along with thorough reproduction instructions:

AMD OLMo 1B: A subset of Dolma v1.7 tokens totaling 1.3 trillion were used for pre-training.
The OpenHermes-2.5, WebInstructSub, and Code-Feedback datasets underwent supervised fine-tuning (SFT) in the second phase of AMD OLMo 1B, after the Tulu V2 dataset in the first.
AMD OLMo 1B SFT DPO: This model aligns with human preferences by using Direct Preference Optimization (DPO) with the UltraFeedback dataset.

The model architecture and training setup of the fully open source 1 billion version of OLMo serve as the foundation for AMD OLMo 1B, with a few notable exceptions. We pre-train with less than half of the tokens used for OLMo-1B (effectively halving the compute budget while maintaining comparable performance) and post-train with a two-phase SFT and DPO alignment (OLMo-1B does not carry out any post-training steps) to enhance performance in general reasoning, instruction-following, and chat capabilities.

For the two-phase SFT, it produces a data mix of diverse and excellent publicly available instructional datasets. All things considered, its training recipe helps create several models that perform better than other similar fully open-source models trained on publicly available data on a variety of benchmarks.

OLMo AMD

The AMD OLMo models are transformer language models that only require decoders and are trained via next-token prediction. This is the model card, which contains the training hyperparameter data and the primary model architecture.

Recipe for Data and Training

It trained the AMD OLMo series of models in three stages, as shown in Figure 1.

Recipe for Data and Training

Phase 1: Pre-training

The pre-training phase involved training on a huge corpus of general-purpose text data to teach the model to learn the language structure and gain wide world knowledge through next-token prediction challenges. 1.3 trillion tokens were chosen from the publicly available Dolma v1.7 dataset.

Step 2: SFT, or supervised fine-tuning

The previously trained model was then enhanced using instructional datasets to enable its model to obey directions. This stage is divided into two phases:

Stage 1: The TuluV2 dataset, a high-quality instruction dataset of 0.66 billion tokens that is made publically available, is used to enhance the model initially.

Step 2: To greatly improve the instruction following capabilities, the model will be refined using Open Hermes 2.5, a relatively larger instruction dataset. In order to improve the model's performance in the domains of coding, science, and mathematical problem solving, the Code-Feedback and WebInstructSub datasets are also utilized at this phase. These databases include almost 7 billion tokens in total.

It conducted a number of fine-tuning tests using different dataset orderings over the course of the two rounds and found that the aforementioned sequencing was most advantageous. In Stage 1, they use a tiny yet high-quality dataset to establish a strong basis. To further improve the model's capabilities, a larger and more diverse dataset combination is used in Stage 2.

Phase Three: Alignment

Lastly, it uses Direct Preference Optimization (DPO) to further refine its SFT model using the UltraFeedback dataset, a large-scale, fine-grained, and varied preference dataset. This enhances model alignment and produces outcomes consistent with human preferences and values.

Findings

It compares AMD OLMo models to other fully open-source models of a similar size that have released their data, model weights, and training code to the public. The pre-trained baseline models that are used for comparison are TinyLLaMA-v1.1 (1.1B), MobiLLaMA-1B (1.2B), OLMo-1B-hf (1.2B), OLMo-1B-0724-hf (1.2B), and OpenELM-1_1B (1.1B).

compares pre-trained models to a range of established benchmarks for general reasoning ability. to use Language Model Evaluation Harness to evaluate common sense thinking, multitask understanding, and responsible AI benchmarks. Out of the 11 benchmarks, it evaluates GSM8k in an 8-shot setting, BBH in a 3-shot setting, and the remaining benchmarks in a zero-shot scenario.

With AMD OLMo 1B:

The average overall general reasoning task score (48.77%) is comparable to the latest OLMo-0724-hf model (49.3%) and outperforms all other baseline models with less than half of its pre-training compute budget.
Accuracy gains over the next best models on the ARC-Easy (+6.36%), ARC-Challenge (+1.02%), and SciQ (+0.50%) benchmarks.

The instruction-tuned chat equivalents of the pre-trained baselines, TinyLlama-1.1B-Chat-v1.0, MobiLlama-1B-Chat, and OpenELM-1_1B-Instruct, were used to evaluate the chat capabilities. It used Language Model Evaluation Harness to evaluate common sense reasoning, multi-task comprehension, and responsible AI benchmarks, and Alpaca Eval to evaluate instruction-following skills and MT-Bench to evaluate multi-turn conversation skills.

Regarding the comparison of previous instruction-tuned baselines with the adjusted and aligned models:

The model accuracy was improved by two phases SFT from the pre-trained checkpoint on average for almost all benchmarks, including MMLU by +5.09% and GSM8k by +15.32%.

Significantly better (+15.39%) than the next best baseline model (TinyLlama-1.1B-Chat-v1.0 at 2.81%) is AMD OLMo 1B SFT performance on GSM8k (18.2%).
SFT model’s average accuracy across standard benchmark is at least +2.65% better than baseline chat models. It is further strengthened by alignment (DPO) by +0.46%.
SFT model also surpasses the next-best model on the conversation benchmarks AlpacaEval 2 (+2.29%) and MT-Bench (+0.97%).
How alignment training enables it AMD OLMo 1B SFT DPO model to work comparable to other conversation baselines on responsible AI assessment benchmarks.

AMD OLMo models can also be used for inference on AMD Ryzen AI PCs with Neural Processing Units (NPUs). AMD Ryzen AI Software makes it simple for developers to run Generative AI models locally. By improving energy efficiency, ensuring data privacy, and allowing a variety of AI applications, local deployment of such models on edge devices offers a secure and sustainable option.

In conclusion

AMD OLMo models perform on responsible AI benchmarks on par with or better than other fully open models of a similar size in terms of general reasoning and chat capabilities thanks to an end-to-end training pipeline that runs on AMD Instinct GPUs and consists of a pre-training stage with 1.3 trillion tokens (half the pre-training compute budget compared to OLMo-1B), a two-phase supervised fine-tuning stage, and a DPO-based human preference alignment stage.

Additionally, AMD Ryzen AI PCs with NPUs, which may assist allow a wide range of edge use cases, were equipped with the language model. The fundamental purpose of making the data, weights, training recipes, and code publicly available is to assist developers in reproducing and innovating further. AMD is still dedicated to releasing a continual flow of new AI models to the open-source community and looks forward to the breakthroughs that will arise from their joint work.