[{"data":1,"prerenderedAt":319},["ShallowReactive",2],{"blog-gpu-architecture-notes":3},{"id":4,"title":5,"author":6,"body":7,"categories":304,"date":308,"description":309,"extension":310,"hidden":311,"meta":312,"navigation":313,"path":314,"seo":315,"stem":316,"thumbnail":317,"__hash__":318},"blog\u002Fblog\u002Fgpu-architecture-notes.md","How the RTX 3090 Actually Works: GPU Architecture notes...","Anurag Kanade",{"type":8,"value":9,"toc":284},"minimark",[10,27,32,39,44,47,57,64,74,80,103,107,110,131,135,138,155,159,162,167,171,175,178,182,185,189,192,196,200,208,212,215,219,222,226,229,277],[11,12,13,14,21,22,26],"p",{},"I spent some time watching ",[15,16,20],"a",{"href":17,"rel":18},"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=example",[19],"nofollow","Branch Education's video"," on how GPUs work, specifically the RTX 3090, and took detailed notes. Figured I'd clean them up and share what I learned about the ",[23,24,25],"strong",{},"GA102 architecture",".",[28,29,31],"h2",{"id":30},"the-hardware-breakdown","The Hardware Breakdown",[11,33,34,35,38],{},"We're looking at ",[23,36,37],{},"GA102",", which is the 3090's GPU processor architecture.",[40,41,43],"h3",{"id":42},"the-hierarchy","The Hierarchy",[11,45,46],{},"The architecture is organized in layers:",[48,49,50],"ol",{},[51,52,53,56],"li",{},[23,54,55],{},"7 GPCs"," (Graphics Processing Clusters) at the top level",[11,58,59],{},[60,61],"img",{"alt":62,"src":63},"GPU architecture showing the 7 GPCs","https:\u002F\u002Flearnopencv.com\u002Fwp-content\u002Fuploads\u002F2025\u002F05\u002FGraphics-Processing-Clusters.png",[48,65,67],{"start":66},2,[51,68,69,70,73],{},"Within each GPC, there are ",[23,71,72],{},"12 SMs"," (Streaming Multiprocessors)",[11,75,76],{},[60,77],{"alt":78,"src":79},"Internal structure of a Streaming Multiprocessor (SM)","https:\u002F\u002Flearnopencv.com\u002Fwp-content\u002Fuploads\u002F2025\u002F05\u002FStreaming-Multiprocessors.png",[48,81,83,93],{"start":82},3,[51,84,85,86,89,90],{},"Inside each SM, there are ",[23,87,88],{},"4 warp schedulers"," and ",[23,91,92],{},"1 Ray Tracing core",[51,94,95,96,99,100],{},"Inside each warp, there are ",[23,97,98],{},"32 CUDA cores"," (shading cores) and ",[23,101,102],{},"1 Tensor core",[40,104,106],{"id":105},"total-core-count","Total Core Count",[11,108,109],{},"Across the entire GPU:",[111,112,113,119,125],"ul",{},[51,114,115,118],{},[23,116,117],{},"10,752"," CUDA cores",[51,120,121,124],{},[23,122,123],{},"336"," Tensor cores",[51,126,127,130],{},[23,128,129],{},"84"," Ray Tracing cores",[40,132,134],{"id":133},"around-the-edge","Around the Edge",[11,136,137],{},"The chip's periphery includes:",[111,139,140,143,146,149,152],{},[51,141,142],{},"12 graphics memory controllers",[51,144,145],{},"NVLink controllers",[51,147,148],{},"PCIe interface",[51,150,151],{},"6MB Level 2 SRAM cache at the bottom",[51,153,154],{},"Gigathread Engine that manages all 7 GPCs and the streaming multiprocessors inside",[40,156,158],{"id":157},"inside-each-sm","Inside Each SM",[11,160,161],{},"Each streaming multiprocessor contains:",[111,163,164],{},[51,165,166],{},"128KB of L1 cache\u002Fshared memory (configurable split)",[28,168,170],{"id":169},"what-each-core-does","What Each Core Does",[40,172,174],{"id":173},"cuda-cores","CUDA Cores",[11,176,177],{},"Can be thought of as simple binary calculators - they handle addition, multiplication, and a few other basic operations.",[40,179,181],{"id":180},"tensor-cores","Tensor Cores",[11,183,184],{},"Matrix multiplication and addition calculators. They're used the most when working with geometrical transformations and neural networks.",[40,186,188],{"id":187},"ray-tracing-cores","Ray Tracing Cores",[11,190,191],{},"The fewest and the largest cores. They're specially designed for ray tracing algorithms.",[28,193,195],{"id":194},"key-terminologies","Key Terminologies",[40,197,199],{"id":198},"fma-fused-multiply-add","FMA (Fused Multiply-Add)",[11,201,202,203,207],{},"The operation ",[204,205,206],"code",{},"A × B + C",". This is a fundamental calculation that gets used constantly in GPU operations.",[40,209,211],{"id":210},"simd-single-instruction-multiple-data","SIMD (Single Instruction, Multiple Data)",[11,213,214],{},"GPUs solve embarrassingly parallel problems using SIMD - applying one instruction to multiple data points simultaneously.",[40,216,218],{"id":217},"simt-single-instruction-multiple-threads","SIMT (Single Instruction, Multiple Threads)",[11,220,221],{},"Basically SIMD but adds a program counter, which avoids conflicts from dependency and branching of operations.",[28,223,225],{"id":224},"computational-architecture-physical-hardware","Computational Architecture → Physical Hardware",[11,227,228],{},"Now that we understand how SIMD\u002FSIMT works, here's how the computational architecture maps to the physical hardware:",[111,230,231,237,246,252,255,268],{},[51,232,233,234],{},"Each instruction is completed by a ",[23,235,236],{},"thread",[51,238,239,240,242,243],{},"A ",[23,241,236],{}," is paired with a ",[23,244,245],{},"CUDA core",[51,247,248,249],{},"Threads are bundled into groups of 32 called ",[23,250,251],{},"warps",[51,253,254],{},"The same sequence of instructions are issued to all threads in a warp",[51,256,257,260,261,264,265],{},[23,258,259],{},"Warps"," are grouped into ",[23,262,263],{},"thread blocks",", which are handled by a ",[23,266,267],{},"Streaming Multiprocessor (SM)",[51,269,270,260,273,276],{},[23,271,272],{},"Thread blocks",[23,274,275],{},"grids",", which are computed across the entire GPU",[11,278,279,280,283],{},"All these operations are managed and scheduled by the ",[23,281,282],{},"Gigathread Engine",", which maps the available thread blocks to the streaming multiprocessors.",{"title":285,"searchDepth":66,"depth":66,"links":286},"",[287,293,298,303],{"id":30,"depth":66,"text":31,"children":288},[289,290,291,292],{"id":42,"depth":82,"text":43},{"id":105,"depth":82,"text":106},{"id":133,"depth":82,"text":134},{"id":157,"depth":82,"text":158},{"id":169,"depth":66,"text":170,"children":294},[295,296,297],{"id":173,"depth":82,"text":174},{"id":180,"depth":82,"text":181},{"id":187,"depth":82,"text":188},{"id":194,"depth":66,"text":195,"children":299},[300,301,302],{"id":198,"depth":82,"text":199},{"id":210,"depth":82,"text":211},{"id":217,"depth":82,"text":218},{"id":224,"depth":66,"text":225},[305,306,307],"Hardware","GPU","Computer Architecture","2025-01-05","Breaking down the GA102 architecture - from GPCs and streaming multiprocessors to CUDA cores and ray tracing units.","md",false,{},true,"\u002Fblog\u002Fgpu-architecture-notes",{"title":5,"description":309},"blog\u002Fgpu-architecture-notes",null,"0uX3CWNMeYcu4mSbeptZAMnfujlBOKFEhMugJ-LB0sg",1775296369322]