Generative Data Intelligence

Intel, Ampere show LLMs on CPUs isn’t as crazy as it sounds

Date:

Popular generative AI chatbots and services like ChatGPT or Gemini mostly run on GPUs or other dedicated accelerators, but as smaller models are more widely deployed in the enterprise, CPU-makers Intel and Ampere are suggesting their wares can do the job too – and their arguments aren’t entirely without merit.

To be clear, running LLMs on CPU cores has always been possible – if users are willing to endure slower performance. However, the penalty that comes with CPU-only AI is reducing as software optimizations are implemented and hardware bottlenecks are mitigated.

On stage at Intel’s Vision event in April, CEO Pat Gelsinger revealed the chipmaker’s progress getting larger LLMs to run on its Xeon platform. In a demo of its upcoming Granite Rapids Xeon 6 processor, Gelsinger showed Meta’s Llama2-70B model running at 4-bit precision with second token latencies of 82ms.

First token latency is the time a model spends analyzing a query and generating the first word of its response. Second token latency is the time taken to deliver the next token to the end user. The lower the latency, the better the perceived performance.

Because of this, inference performance is often given in terms of milliseconds of latency or tokens per second. By our estimate, 82ms of token latency works out to roughly 12 tokens per second.

While slow compared to modern GPUs, it’s still a sizeable improvement over Chipzilla’s 5th-gen Xeon processors launched in December, which only managed 151ms of second token latency.

Oracle has also shared results for running the smaller Llama2-7B model on Ampere’s Altra CPUs. Putting its 64-core OCI A1 instance against a 4-bit quantized version of the model, Oracle was able to achieve between 33 and 119 tokens per second of throughput for batch sizes 1 and 16, respectively.

In the context of a chatbot, a larger batch size translates into a larger number of queries that can be processed concurrently. Oracle’s testing showed the larger the batch size, the higher the throughput – but the slower the model was at generating text. For example, at a batch size of 16, Oracle managed its best throughput – but only about 7.5 tokens per second per query. And, from the end user’s perspective, that’s what they’re going to notice.

While Oracle has shared results at multiple batch sizes, it should be noted that Intel has only shared performance at batch size of one. We’ve asked for more detail on performance at higher batch sizes and we’ll let you know if we Intel responds.

According to Ampere chief product officer Jeff Wittich, much of this was possible thanks to custom software libraries and optimizations to Llama.cpp made in collaboration with Oracle. Both Oracle and Intel have since shared performance data for Meta’s newly launched Llama3 models showing similar performance characteristics.

Assuming these performance claims are accurate – given the test parameters and our experience running 4-bit quantized models on CPUs, there’s not an obvious reason to assume otherwise – it demonstrates that CPUs can be a viable option for running small models. Soon, they may also handle modestly sized models – at least at relatively small batch sizes.

While Intel and Ampere have demonstrated LLMs running on their respective CPU platforms, it’s worth noting that various compute and memory bottlenecks mean they won’t replace GPUs or dedicated accelerators for larger models.

For the kind of models pushing the envelope of generative AI capabilities, Ronak Shah, director of Xeon AI product management at Intel, told The Register forthcoming products like the Gaudi accelerator were designed to do the job.

Forget embarrassingly parallel, how about just parallel enough?

Talk of running LLMs on CPUs has been muted because, while conventional processors have increased core counts, they’re still nowhere near as parallel as modern GPUs and accelerators tailored for AI workloads.

But CPUs are improving. Modern units dedicate a fair bit of die space to features like vector extensions or even dedicated matrix math accelerators.

Intel implemented the latter beginning with its Sapphire Rapids Xeon Scalable processors, launched early last year. Each core is equipped with Advanced Matrix Extensions (AMX) too – although they’re not enabled on every SKU thanks to the magic of software-defined silicon.

As the name suggests, AMX extensions are designed to accelerate the kinds of matrix math calculations common in deep learning workloads. Since then, Intel has beefed up its AMX engines to achieve higher performance on larger models. This appears to be the case with Intel’s Xeon 6 processors, due out later this year.

While Intel leans heavily on matrix acceleration, Ampere’s Wittich told us that the chip shop can achieve acceptable performance using the two 128-bit vector units baked into each of its AmpereOne and Altra cores. These vector units support FP16, BF16, INT8, and INT16 precisions.

The memory problem

While CPUs are nowhere near as fast as GPUs at pushing OPS or FLOPS, they do have one big advantage: they don’t rely on expensive capacity-constrained high-bandwidth memory (HBM) modules.

As we’ve discussed on numerous occasions, running a model at FP8/INT8 requires around 1GB of memory for every billion parameters. Running something like OpenAI’s 1.7 trillion parameter GPT-4 model at FP8 therefore requires over 1.7TB of memory – roughly half that when quantized to 4-bits. That’s more than any one GPU can provide, but well within the capabilities of modern CPUs.

The catch, of course, is that capacious DRAM modules accessed by CPUs are glacially slow compared to HBM.

With just eight memory channels currently supported on Intel’s 5th-gen Xeon and Ampere’s One processors, the chips are limited to roughly 350GB/sec of memory bandwidth when running 5600MT/sec DIMMs. And while Wittich told us there is a 12-channel version of its chip planned for later this year – apparently with 256 cores – it isn’t out yet.

That said, all of Oracle’s testing has been on Ampere’s Altra generation, which uses even slower DDR4 memory and maxes out at about 200GB/sec. This means there’s likely a sizable performance gain to be had just by jumping up to the newer AmpereOne cores.

Now that might sound fast – certainly way speedier than an SSD – but eight HBM modules found on AMD’s MI300X or Nvidia’s upcoming Blackwell GPUs are capable of speeds of 5.3 TB/sec and 8TB/sec respectively. The main drawback is a maximum of 192GB of capacity.

In this sense, you can think of the memory capacity sort of like a fuel tank, the memory bandwidth as akin to a fuel line, and the compute as an internal combustion engine. It doesn’t matter how big your fuel tank or how powerful your engine is, if the fuel line is too small to feed the engine with enough gas to keep it running at peak performance.

Because of this, attempts to run LLMs on CPUs have largely been limited to smaller models.

Clearing the bottlenecks

Despite these limitations, Intel’s upcoming Granite Rapids Xeon 6 platform offers some clues as to how CPUs might be made to handle larger models in the near future.

As we mentioned earlier, Intel’s latest demo showed a single Xeon 6 processor running Llama2-70B at a reasonable 82ms of second token latency. There’s a lot we still don’t know about the test rig – most notably how many and how fast those cores are clocked. We’ll have to wait until later this year – we’re thinking December – to find out.

“The big thing that’s happening going from 5th-gen Xeon to Xeon 6 is we’re introducing MCR DIMMs, and that’s really what’s unlocking a lot of the bottlenecks that would have existed with memory bound workloads,” Shah explained.

Multiplexer combined rank (MCR) DIMMs allow for much faster memory than standard DRAM. Intel has already demonstrated the tech running at 8,800MT/sec. And with 12 memory channels kitted out with MCR DIMMs, a single Granite Rapids socket would have access to roughly 825GB/sec of bandwidth – more than 2.3x that of last gen and nearly 3x that of Sapphire.

Wittich notes Ampere is also looking at MCR DIMMs, but didn’t say when we might see the tech employed in silicon.

However, faster memory tech isn’t Granite Rapids’ only trick. Intel’s AMX engine has gained support for 4-bit operations via the new MXFP4 data type, which in theory should double the effective performance.

This lower precision also has the benefit of shrinking the model footprint and reducing the memory capacity and bandwidth requirements of the system. Of course, many of the footprint and bandwidth advantages can also be achieved using quantization to compress models trained at higher precisions. So, practically speaking, the benefit of 4-bit mathematics support in hardware comes down to performance.

Following the hump and optimizing for enterprise

Getting the mix of AI capabilities right is a bit of a balancing act for CPU designers. Dedicate too much die area to something like AMX, and the chip becomes more of an AI accelerator than a general-purpose processor.

So, instead of trying to make CPUs capable of running the largest and most demanding LLMs, vendors are looking at the distribution of AI models to identify which will see the widest adoption and optimizing products so they can handle those workloads.

“The sweet spot right now from a customer perspective is that 7–13 billion parameter model. That’s where we put most of our focus today,” Wittich declared.

Intel’s Shah is seeing a similar spread in his engagements with customers. As we saw with Intel’s Llama3-8B performance claims, Xeon 6 performs quite well when running smaller models with second token latencies as low as 20ms for a dual socket config when running at the more challenging BF16 data type.

As generative AI evolves, the expectation is the peak in model distribution will shift toward larger parameter counts. But, while frontier models have exploded in size over the past few years, Wittich expects mainstream models will grow at a much slower pace.

He added that enterprise applications of AI are likely to be far less demanding than the public-facing AI chatbots and services which handle millions of concurrent users.

Ampere’s own testing found that its CPUs scaled quite competitively against competing Arm CPUs from AWS and Nvidia’s A10 tenor core GPU – until batch sizes hit eight, at which point the GPU pulled ahead.

The key takeaway is that as user numbers and batch sizes grow, the GPU looks better. Wittich argues, however, that it’s entirely dependent on the use case.

“In order to actually get to a practical solution with an A10, or even an A100 or H100, you’re almost required to increase the batch size, otherwise, you end up with a ton of underutilized compute,” he explained.

In an enterprise environment, Wittich made the case that the number of scenarios where a chatbot would need to contend with large numbers of concurrent queries is relatively small.

“We believe that CPUs will run the vast majority of inferencing on the LLM side,” he concluded. ®

spot_img

Latest Intelligence

spot_img