Amazon Web Services (AWS) has made a new compute instance featuring its machine learning chips generally available.

The company this week announced the general availability of Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances, powered by the company’s Trainium chips.

AWS Trainium instance.png
– AWS

First announced in December 2020, Trainium chips are purpose-built for ‘high-performance ML training applications in the cloud’.

A preview of Amazon EC2 Trn1 instances was announced at AWS re:Invent 2021. Amazon claims the new instance offers up to 50 percent cost savings over comparable GPU-based EC2 instances.

“You can use EC2 Trn1 instances to train natural language processing (NLP), computer vision, and recommender models across a broad set of applications, such as speech recognition, recommendation, fraud detection, image and video classification, and forecasting,” the company said.

“Over the years we have seen machine learning go from a niche technology used by the largest enterprises to a core part of many of our customers' businesses, and we expect machine learning training will rapidly make up a large portion of their compute needs,” said David Brown, vice president of Amazon EC2 at AWS. “Building on the success of AWS Inferentia, our high-performance machine learning chip, AWS Trainium is our second-generation machine learning chip purpose-built for high-performance training. Trn1 instances powered by AWS Trainium will help our customers reduce their training time from months to days while being more cost-efficient.”

Trn1 instances feature up to 16 AWS Trainium chips, a second-generation ML chip built by AWS after AWS Inferentia.

Trn1 instances are the first EC2 instances with up to 800 Gbps of Elastic Fabric Adapter (EFA) network bandwidth.; each instance has 512 GB of high-bandwidth memory, delivers up to 3.4 petaflops of FP16/BF16 compute power, and features a high-bandwidth nonblocking NeuronLink interconnect

Trn1 instances are deployed in EC2 UltraClusters and able to scale to up to 30,000 Trainium accelerators, equating to a supercomputer with 6.3 exaflops of computing.

Last year, YellowDog created a distributed supercomputer on AWS, pulling together 3.2m vCPUs (virtual CPUs) for seven hours to analyze and screen 337 potential medical compounds for OMass Therapeutics. The effort won the temporary machine the 136th spot in the Top500, managing a performance of 1.93 petaflops.

Amazon EC2 Trn1 instances are available in two sizes: trn1.2xlarge, for experimenting with a single accelerator and training small models cost-effectively, and trn1.32xlarge for training large-scale models.

However, they are only currently available in US East (N. Virginia) and US West (Oregon). Additional AWS Regions are due in the future.

“We are training large language models that are multi-modal, multilingual, multi-locale, pre-trained on multiple tasks, and span multiple entities (products, queries, brands, reviews, etc.) to improve the customer shopping experience,” said Trishul Chilimbi, senior principal scientist at Amazon Search. “Amazon EC2 Trn1 instances provide a more sustainable way to train large language models by delivering the best performance/watt compared to other accelerated machine learning solutions and offers us high performance at the lowest cost. We plan to explore the new configurable FP8 datatype and hardware accelerated stochastic rounding to further increase our training efficiency and development velocity.”

“We launched a large-scale AI chatbot service on the Amazon EC2 Inf1 instances and reduced our inference latency by 97% over comparable GPU-based instances while also reducing costs. As we keep fine-tuning tailored natural language processing models periodically, reducing model training times and costs is also important,” added Takuya Nakade, CTO at Money Forward. “Based on our experience from successful migration of inference workload on Inf1 instances and our initial work on AWS Trainium-based EC2 Trn1 instances, we expect Trn1 instances will provide additional value in improving end-to-end machine learning performance and cost.”

Subscribe to our daily newsletters