Wednesday, the MLCommons, the industry consortium that oversees a popular test of machine learning performance, MLPerf, released its latest benchmark test report, showing new adherents including computer makers ASUS, H3C, and ZhejiangLab, a research institute formed by the Zhejiang province government in China, Zhejiang University and Chinese retail and AI giant Alibaba.
Those parties join frequent submitters Nvidia, Qualcomm, Dell, and Microsoft.
The MLCommons’s executive director, David Kanter, lauded the record number of submissions, over 3,900. Those results span a wide range of computing, from data centers down to what is known as “TinyML,” running on devices such as embedded microchips that sip fractions of a watt of power.
“This is a huge dynamic range,” said Kanter. The fastest performance on the benchmark ResNet-50 is a million times faster than the slowest system, he noted. “It’s hard to operate over a wide performance range, but that’s actually something we’ve done very well.”
For example, the test for inference in cloud data centers, where the bulk of submissions are given, this time reported 926 distinct test results, across 84 systems, by 14 parties. That is up from 754 reported test results from 67 systems submitted by 13 submitters in the September version of the benchmark.
Various companies that participate in the four-year-old effort may not show up from one report to the next. For example, Intel and Hewlett Packard Enterprise, which both had multiple submissions to report in September, were absent from the latest report.
Across different benchmark scores, said the MLCommons, results show as much as 3.3 times speed-up for computers running neural network tasks such as natural language processing, image recogntion, and speech recognition.
A highlight of the report this time around is that more vendors submitted more results to measure the power consumption of their computer systems on AI tasks. As ZDNet reported in September, the number of submissions for power consumption had plummeted to just 350 submissions from 864 in the April report.
This time around, there were 576 reported results for inference in cloud data centers and in cloud “edge” devices, across 30 different systems. There were another 3,948 power measurements reported by Krai, the stealth-mode AI startup that always submits a large number of test results in the category “Open Edge,” where submitters are free to use non-standard neural network approaches.
Krai reported many more combinations of chips this time where before it had reported only Nvidia’s Jetson AGX Xavier accelerator. This time, Krai reported results for dozens of Raspberry Pi embedded computing devices.
The expansion of the submissions has been helped, noted Kanter, by some new approaches adopted by MLCommons. For example, this time around, submitters were allowed to use what is called “early stopping,” where a submitter can stop their test taking before a certain number of training “epochs” have passed, rather than train for as long as possible.
Doing so meant that slower systems that would be challenged to even complete a benchmark test, especially lower-power devices such as the Raspberry Pi, were no longer at an extreme disadvantage.
“Early stopping is super-helpful,” said Kanter. “If you can cut down your runtime by a factor of ten, you can do ten times as many benchmarks.”
“This time around, the percentage of closed submissions with power measurement went from 15.7% to 17.6%, so, an increase, but we still have work to do there,” said Kanter. “Closed” refers to submissions that adhere strictly to the MLCommon’s benchmark neural network configuration.
In the “open” vision, where submitters can take liberties with neural network formation, which is dominated by Krai, the number of submissions with power measurements surged to 86% from 32%, said Kanter.
“We had some submitters who were not able to get a power meter last time because of supply chain issues,” said Kanter.
In the MLPerf TinyML section, where benchmark tasks include such things as the latency in detecting a “wake word” — the thing that activates a smart speaker or another AI assistant — eight vendors competed with novel processors, including computer chip designer Andes Technology. Andes’s “AndesCore” chips make use of the open-source RISC-V computer instruction set, which is competing with ARM and Intel to be an instruction set that can be freely modified for any kind of computing device.
On one common task, “visual wake words,” which makes use of the data set known as COCO 14, “common objects in context,” to test object recognition in images, the top score in terms of latency was taken by startup Plumerai, which creates its own software to train and deploy AI models on standard microprocessors.
Using an STMicroelectronics chip with an ARM Cortex A-7 processor core, Plumerai delivered COCO 14 results in 59.4 milliseconds of latency.
The only category that saw a decline in reported results was the category of mobile ML, consisting of results for mobile phones and laptops. Qualcomm and Samsung each submitted one system, a smartphone, but the category for laptops was completely empty, whereas in October, it had one submission from Intel.
Asked about the paucity of reports for mobile, MLCommons’s Kanter noted that it mobile is hard as a category because mobile phones are a product line that no one wants to talk about before the phones are announced, unlike cloud and edge servers that have long product life cycles.
“Many of the mobile members [of MLCommons] are system-on-chip makers, and they may not want to use their partners’ phones in a pre-release fashion,” said Kanter. On the flip side, there is less motivation to submit benchmark results on phones that are three or four months old.
Kanter said the MLCommons is working on ways in future to try to bridge the gap by making it easier for smartphone makers to submit without harming away their product unveils.
“Longer term, we would like to do is get it so that there is a way, if you’re going to launch a smartphone on April 23rd, say, that you can show up on stage with an MLPerf number that day, so that we enable our partners and our members to launch with MLPerf,” said Kanter.
The Qualcomm and Samsung submissions for phones consisted of a Xiaomi MI12 phone, in Qualcomm’s case, and the Samsung Galaxy S22+ 5G, which showed off their respective processors, Snapdragon 8 Gen1 and Exynos 2200. In tasks such as image recognition and natural language processing, Qualcomm’s Xiaomi results had lower latency than the Samsung Galaxy performance across most of the tests.
More technical details on the MLPerf benchmark approach are available in the original research paper posted in 2019 by the group.