ROCm Is AMD’s No. 1 Priority, Exec Says

25 Sep 2023
Electronics

ROCm Is AMD’s No. 1 Priority, Exec Says

[ad_1]

//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>

SANTA CLARA, CALIF. — “If you think about the product portfolio that AMD has, it’s arguably the broadest in the industry in terms of AI compute,” Vamsi Boppana, senior VP of the AI group at AMD, said in his keynote address at the recent AI Hardware Summit. AMD’s hardware portfolio includes data-center–class CPUs and GPUs, consumer GPUs, FPGAs and the Ryzen 7040, a client CPU with NPU designed for PCs. Software is key to unlocking the performance of these different hardware platforms. But how does AMD compete with its GPU competitors’ strong offerings, given its more diverse hardware?

AMD’s software stacks for each class of product are separate: ROCm (short for Radeon Open Compute platform) targets its Instinct data center GPU lines (and, soon, its Radeon consumer GPUs), Vitis AI targets its FPGAs, and ZenDNN targets its client devices.

How far along is AMD with unifying these stacks?

“We have enormous customer pull coming, and that is dictating quite a bit of our near-term plans,” Boppana told EE Times in an interview after his talk here. “The plane is flying right now, so we cannot disassemble the engine. However, we are absolutely doing things at the foundational level to make more unification happen in our stack.”

AMD’s Vamsi Boppana gives a keynote address at the recent AI Hardware Summit in Santa Clara, Calif. (Source: Kisaco Research & Jeffrey Hosier Photography)

Boppana said that there’s some common infrastructure and tooling underlying all three stacks, including an ongoing effort to make a common quantizer.

“Over time, we want to get to a place where users have one execution provider, and underneath that, you will be able to select [a hardware target],” he said. “In the near term, modules are shared across stacks, and over time, as things like heterogeneous platforms are going to become prevalent, the unified elements start coming through.”

A unified stack would be helpful for heterogeneous systems, Boppana said, especially where partitioning is required. Currently, the Vitis stack handles CPU plus xDNA targets, but he agrees that both automatic and user-driven partitioning will be necessary.

“In that scenario, we need to be able to take a problem statement and cut the graph, such that both parts of the graph get executed on [different parts of the hardware], and they need to inter-operate,” he said.

‘ROCm has evolved’

ROCm is less mature than competitors’ GPU software offerings, with Nvidia’s mature CUDA stack often seen as a big part of the market leader’s competitive advantage.

“Software is a journey,” Boppana said. “Anybody who has written or managed complex pieces of software knows it takes time. The good news is, we have been on the journey…ROCm has evolved.”

AMD has made ROCm the No. 1 priority at the company level in the last year, Boppana said, standing up a new organization that’s brought together assets from all the company’s software contributions.

“We have much larger resources actually working on software, and [AMD CEO Lisa Su] has been very clear that she wants to see significant and continued investments on the software side,” Boppana said. “We have agreed to provide people internally, we have acquired Mipsology, and we are looking to grow talent both organically and inorganically.”

AMD also recently stood up an internal AI models group to increase its experience using its own software stack.

“We want a much tighter feedback loop,” Boppana said.

Using open source to challenge Nvidia

AMD has embraced OpenAI’s Triton, an open-source programming language and compiler for GPUs that promises to offer an open-source alternative to Nvidia’s CUDA for developers who want to write high-level code that performs optimally on the hardware.

“There are different personas that are programming [our GPUs],” he said. “[Triton] is a level of abstraction that people are comfortable with. It’s productive. And it gets to hardware in a pretty efficient, cogent fashion. But for other customers, that doesn’t matter; they don’t need to develop new kernels. For them, we can ship libraries. So, it’s just a matter of who wants to use us.”

In contrast to Nvidia’s approach with CUDA, which is mostly proprietary, most of AMD’s ROCm stack is open source.

“We partner with the [AI frameworks] and the people writing the libraries and say, ‘If you have a kernel you want to put together, you can take something that exists from us, but if you find there’s the opportunity for you to optimize source code, [you can]’,” he said. “Then we have so many more people that are willing and able to contribute. So, that’s very important and very powerful for us: We think it’s the right strategic direction for us to take.”

AMD has a diverse portfolio of hardware architectures for AI acceleration. But how does the company manage its AI software stack for such diverse architectures? (Source: AMD)

MI300 samples are currently with customers, Boppana said, and both customers and AMD have AI training workloads up and running, with availability coming at the end of this year.

ROCm will be crucial to the success of both the MI300 and MI300X.

“Being candid, we have a few places to grow,” he said. “Allowing the community to contribute [to ROCm] alongside us helps us bridge the gap faster.”

[ad_2]

Source link