Groq: Enabling Fast AI Inference with custom designed LPUs

Language Processing Units delivers exceptional compute speed

Mert Bozkir

Dec 09, 2024

I know you have a lot of questions…

How did groq chips come out?
Why is it incredibly fast compared to other engines?
What is the deal with LPUs?

Groq Says It Can Deploy 1 Million AI Inference Chips In Two Years

Groq is a custom-designed chip that accelerates language models, and it’s even called a Language Processing Unit.

The shortest TL;DR:

The story of the Groq chip goes back to the Google TPU team.
One of the founders, Jonathan, wanted to build a startup based on TPU.
Because of the constraints Jonathan left Google, designed groq and boom.

I believe it wasn’t an easy run for them, designing their own chip that would accelerate token speed better than NVIDIA’s. That’s what I call innovation! By the way, personally, I learned all about Groq during the LLM hype. Its benchmark performance blew my mind.

How to use it in a project? Check out my demo-day project using AWS and Groq:

Let’s start with the LPU…

The first principle thinking, Groq is:

faster
cheaper

why?

Because the team came up with ideas about why GPUs are slow and how to make them faster and cheaper at the same time. Let’s do step by step.

Determinism and the Tensor Streaming Processor

How do GPUs work currently? Even though we say GPUs do a great job with parallelization, the I/O bound and moving data between high-bandwidth memory (HBM) are energy-consuming. It’s like the old way of building cars.

Jonathan’s goal for 2025 is 1 billion tokens/sec –with the aramco partnership.

Before Ford’s assembly line, cars were worked on in different warehouses back and forth. Today’s GPUs still struggle with moving data efficiently, much like the inefficient production lines.

And they came up with a proper plan, building a compiler from the ground up for this specific task. Get rid of moving data around, and split the computation into stages.

Meet Groq™ Compiler Solutions for a symbiotic software-hardware ecosystem - Groq is Fast AI Inference — Meet Groq™ Compiler

In the LPU, everything operates over SRAM, resulting in much faster performance. Plus, they design the hardware to be predictable for execution, unlike GPU systems, which are mostly non-deterministic. Even though Groq’s components are older and the chip size isn’t ideal, it still performs 10x faster than GPUs. I can’t wait to see what Groq’s future holds! Here are the hardware specs, everything is custom-built: custom chip, custom Ubuntu golden image, and custom compiler!

GroqRack™: 12 PetaFlops, 14 GB global SRAM, 8x server with 64 + 1 card.

GroqNode™: 1.5 PetaFlops, 1.76 GB of on-die memory. 8x GroqChip.

GroqChip™: 189 TeraFlops, 230 MB of on-die memory

And don’t forget about the benchmarks –here’s the one I came up with. Cerebras is a whole different story. :)”

How to use Groq?

Go to console.groq.com
Create an API key
Enjoy.

Thanks for reading, follow me on LinkedIn!

dope llm engineer

Discussion about this post