Amazon today debuted AWS Trainium, a chip custom-designed to deliver what the company describes as cost-effective machine learning model training in the cloud. It comes ahead of the availability of new Habana Gaudi-based Amazon Elastic Compute Cloud (EC2) instances built specifically for machine learning training, powered by Intel’s new Habana Gaudi processors.
“We know that we want to keep pushing the price performance on machine learning training, so we’re going to have to invest in our own chips,” AWS CEO Andy Jassy said during a keynote address at Amazon’s re:Invent conference this morning. “You have an unmatched array of instances in AWS, coupled with innovation in chips.”
Amazon claims that Trainium will offer the most teraflops of any machine learning instance in the cloud, where a teraflop translates to a chip being able to process one trillion calculations a second. When it becomes available to customers in the second half of 2021 as EC2 instances and in SageMaker, Amazon’s fully managed machine learning development platform, it’ll support popular frameworks including Google’s TensorFlow, Facebook’s PyTorch, and MxNet. Moreover, Amazon says it’ll use the same Neuron SDK as Inferentia, the company’s cloud-hosted machine learning chip for inference.
Absent benchmark results, it’s unclear how Trainium’s performance might compare with Google’s tensor processing units (TPUs), the search giant’s own chips for AI training workloads hosted in Google Cloud Platform. Google says its forthcoming fourth-generation TPU offers more than double the matrix multiplication teraflops of a third-generation TPU. (Matrices are often used to represent the data that feeds into AI models.) It also offers a “significant” boost in memory bandwidth while benefiting from unspecified advances in interconnect technology.