AMD Introduces First SLM, AMD-135M

AMD releases its first small language model, AMD-135M. The impressive abilities of large language models (LLMs) like GPT-4 and Llama in natural language processing and synthesis have garnered significant attention in the quickly evolving area of artificial intelligence. Conversely, small language models (SLMs) are gaining prominence in the AI model field and provide unique advantages in certain use cases.

With its first Small language model, Speculative Decoding, AMD is introducing the AMD-135M. This research demonstrates the commitment to an open approach to AI, which will encourage more creative, moral, and inclusive technology growth and ensure that its benefits are shared more widely and its challenges are addressed more constructively.

Models AMD-135M

AMD-135M is the first AMD small language model.

Two versions were created from the original AMD-135M Small language model for the Llama family: AMD-Llama-135M and AMD-Llama-135M-code. It was trained using 670B tokens on AMD Instinct MI250 accelerators, starting from scratch.

Pretraining: Over the course of six days, the AMD-Llama-135M model was trained from scratch using four MI250 nodes and 670 billion tokens of general data.

AMD-Llama-135M: AMD trained the model on the MI250 accelerator from scratch using 670B of general data. It took us six full days to pretrain AMD-Llama-135M on four MI250 nodes, each of which has four MI250 accelerators (each with eight virtual GPU cards and 64G of RAM).

Code Finetuning: The AMD-Llama-135M-code version was enhanced with an additional 20 billion tokens of code data using the same hardware. This process took four days to finish.

AMD-code Llama-135M: In order to increase accuracy and allow a specific code mode, 20B code data tokens were added to the AMD-Llama-135M. Using four MI250 accelerators, the AMD-Llama-135M-code tuning procedure took four full days.

Code Dataset: AMD utilized the Python portion of the StarCoder dataset to improve the 135M pretrained model. The StarCoder collection is made up of 783GB of code in 86 different programming languages and comprises information from GitHub Issues, Jupyter notebooks, and contributions on GitHub. It has over 250B tokens in total. They focused specifically on the Python programming language.

The training code, dataset, and weights for this model are released as open source to allow developers to duplicate the model and help train further SLMs and LLMs.

Optimisation via Speculative Decoding

Large language models often utilize an autoregressive approach for inference. The main disadvantage of this approach is that it can only generate one token each forward pass, which reduces memory access efficiency and overall inference speed.

Thanks to the advancement of speculative decoding, this problem has been fixed. The main concept is to use a small draft model to construct a set of candidate tokens, which are subsequently verified using a larger target model. This strategy allows for several orders of magnitude speed gains and drastically reduces the amount of memory access needed by creating many tokens on each forward pass without losing performance.

Performance of Inference Acceleration

Using AMD-Llama-135M-code as a draft model for CodeLlama-7b, it assessed the inference performance with and without speculative decoding on the MI250 accelerator for data centers and the Ryzen AI CPU (with NPU) for AI PCs. For the exact settings it analyzed using AMD-Llama-135M-code as the draft model, it saw speedups on the Instinct MI250 accelerator, Ryzen AI CPU, and Ryzen AI NPU compared to the inference without speculative decoding. The AMD-135M SLM offers an end-to-end workflow with both training and inferencing on specific AMD platforms.

To sum up

Using AMD GPU accelerators and Ryzen AI processors, AMD-135M SLM generates a comprehensive workflow that encompasses inferencing and training. This model provides a reference implementation that adheres to best practices for model construction, pretraining, and deployment on AMD platforms, helping to ensure compliance with developer usability criteria and achieving optimal performance in the data center as well as on power-limited edge devices like AI PC. AMD is dedicated to releasing new models to the open-source community and is excited to see the concepts that emerge from this collective.