Blackwell Chip + Meta's GenAI Infrastructure

LLM Infrastructure goes brrr... 🧨

Dec 01, 2024

After following the impactful news of the AI Infra space; I just decided to tell you more about crazy stuff happening right now!

Nvidia CEO Jensen Huang Unveils Next-Generation AI Chip, 'Blackwell' — Jensen Huang’s keynote

First; let's start with the Blackwell chip!

Serverless GPU companies are preparing for the next big leap in performance by investing in blackwell chips. Together.ai, Nebius.com, and Lambda Labs have already pre-ordered Blackwell chips. But, what does newer chip even means?

In the industry newer chips means:

more streaming units!
higher bandwidth memory
better optimized tensor cores!

AI companies already constantly innovating, and that’s why semiconductor chip companies works hard to come up with powerful chips. Advanced AI systems demand immense GPU power for better performance, and all those companies relying heavily on computational capacity.

And, of course here are the blackwell stats:

10TB data transfer per second
two interconnected chips into one
has 208 billion transistors

Aka, the most powerful AI chip in the world!

And when they announced the chip, they also came up with better infiniband, which (they say) powers up to trillion-parametered models, with up to 800Gb/s throughput.

–NVIDIA is crazy, dudee!

Meta's 24K GPU Clusters For the Future of AI Growth - EchoCraft AI

When it comes to ML Infra, let’s talk about the Meta’s GenAI Infrastructure; the powerhouse behind both Llama training, and most of their Meta’s RecSys Algorithms.

They published a blog-post for their infrastructure and how they built from ground. It has all of the secrets of their datacenter! Which switches they’ve used, cluster insights and rack configurations. 🥰

Meta’s advanced use-cases;

pretraining/fine-tuning foundational models
recommendation systems for Instagram/Facebook
image generation, mostly animations
Internal developer productivity for coding

"With great power comes great responsibility" - Uncle Ben

But wait a minute, have you ever thought about how much money Meta is dropping in this huge cluster? 💰 Zuck aiming for AGI, he’s definitely playing the long game. And they’re not gonna dependent on the NVIDIA chips, even though they have huge goal for the end of the year, 600k H100 GPUs as part of their infrastructure. This ambition to control their own destiny extends to the very heart of their AI operations.

How Pytorch Powers Training Inference - AI Infra @Scale 2024

Deep down in the core of Meta's AI infrastructure, PyTorch and custom CUDA kernels are the dynamic duo driving their advancements. PyTorch, with its flexibility and simplicity (they designed it in this way!), provides the framework for building and training complex AI models.But it's the custom CUDA kernels, engineered to maximize NVIDIA GPU efficiency, that unlock Meta's full performance potential. This combination optimizes every part of their system, boosting both training FMs like Llama and running the ranking & recommendation systems that keep users hooked on their platforms

They aim to invest in developing their own inference engine for ranking and recommendations, which will power the Reels, News feed, Ads, and so on; of course, they have a long-term roadmap for this.

The AI infra space is evolving faster than ever, with big tech giants pushing the boundaries to build more efficient and powerful systems. Whether it’s NVIDIA coming up with new GPU chip or Meta unveiling their very own AI infrastructure secrets, one thing’s clear: we’re witnessing significant progress in the AI Revolution for sure!

Thanks for reading, follow me on LinkedIn!

dope llm engineer

Discussion about this post