How the RTX 3090 Actually Works: GPU Architecture notes...

I spent some time watching Branch Education's video on how GPUs work, specifically the RTX 3090, and took detailed notes. Figured I'd clean them up and share what I learned about the GA102 architecture.

The Hardware Breakdown

We're looking at GA102, which is the 3090's GPU processor architecture.

The Hierarchy

The architecture is organized in layers:

7 GPCs (Graphics Processing Clusters) at the top level

GPU architecture showing the 7 GPCs (Graphics Processing Clusters)

Within each GPC, there are 12 SMs (Streaming Multiprocessors)

Internal structure of a Streaming Multiprocessor (SM)

Inside each SM, there are 4 warp schedulers and 1 Ray Tracing core
Inside each warp, there are 32 CUDA cores (shading cores) and 1 Tensor core

Total Core Count

Across the entire GPU:

10,752 CUDA cores
336 Tensor cores
84 Ray Tracing cores

Around the Edge

The chip's periphery includes:

12 graphics memory controllers
NVLink controllers
PCIe interface
6MB Level 2 SRAM cache at the bottom
Gigathread Engine that manages all 7 GPCs and the streaming multiprocessors inside

Inside Each SM

Each streaming multiprocessor contains:

128KB of L1 cache/shared memory (configurable split)

What Each Core Does

CUDA Cores

Can be thought of as simple binary calculators - they handle addition, multiplication, and a few other basic operations.

Tensor Cores

Matrix multiplication and addition calculators. They're used the most when working with geometrical transformations and neural networks.

Ray Tracing Cores

The fewest and the largest cores. They're specially designed for ray tracing algorithms.

Key Terminologies

FMA (Fused Multiply-Add)

The operation A × B + C. This is a fundamental calculation that gets used constantly in GPU operations.

SIMD (Single Instruction, Multiple Data)

GPUs solve embarrassingly parallel problems using SIMD - applying one instruction to multiple data points simultaneously.

SIMT (Single Instruction, Multiple Threads)

Basically SIMD but adds a program counter, which avoids conflicts from dependency and branching of operations.

Computational Architecture → Physical Hardware

Now that we understand how SIMD/SIMT works, here's how the computational architecture maps to the physical hardware:

Each instruction is completed by a thread
A thread is paired with a CUDA core
Threads are bundled into groups of 32 called warps
The same sequence of instructions are issued to all threads in a warp
Warps are grouped into thread blocks, which are handled by a Streaming Multiprocessor (SM)
Thread blocks are grouped into grids, which are computed across the entire GPU

All these operations are managed and scheduled by the Gigathread Engine, which maps the available thread blocks to the streaming multiprocessors.