Layers of Latency
A tech buddy asked me why it is so important for China to catch up in chip fabrication process, can't they just put more servers into a data center? In short, it is not that easy.
By shrinking the fab process you can add more transistors onto one chip, and/or run at a higher frequency, and/or lower power consumption.
The fab process is measured in "nm", nanometers. Meanwhile these numbers do not reflect real scales anymore, but transistor density resp. efficiency of fab process.
Simplified, the MOSFET technology was used up to 22nm, this was a 2D planar transistor design, then from 14 to 7nm FinFET 3D structures, and below 7nm GAAFET 3D structures.
Take a look at the 7nm and 3nm fab process for example:
https://en.wikipedia.org/wiki/7_nm_process#Process_nodes_and_process_offerings
https://en.wikipedia.org/wiki/3_nm_process#3_nm_process_nodes
Roughly spoken, the 7nm process packs ~100M transistors per mm2, the 3nm process packs ~200M transistors per mm2.
And here the latency steps in. As soon as you leave as programmer the CPU you increase latency, this starts with different levels of caches, goes to RAM, goes to PCIe bus, goes to network...
Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference 0.5 ns
L2 cache reference 7 ns
Main memory reference 100 ns
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
Source:
Latency Numbers Every Programmer Should Know
https://gist.github.com/jboner/2841832
As a low level programmer you want to stay on CPU and work preferred via the cache. As a GPU programmer there are several layers of parallelism, e.g.:
1. across shader-cores of a single GPU chip (with >10K shader-cores)
2. across multiple chiplets of a single GPU (with currently up to 2 chiplets)
3. across a server node (with up to 8 GPUs)
4. across a pod of nodes (with 256 to 2048 GPUs resp. TPUs)
5. across a cluster of server nodes/pods (with up to 100K GPUs in a single data center)
6. across a grid of clusters/nodes
With each layer adding increasing amounts of latency.
So as a GPU programmer you want ideally to hold your problem space in memory of, and run your algorithm on, a single but thick GPU.
Neural networks for example are a natural fit to run on a GPU, so called embarrassingly easy parallelism,
https://en.wikipedia.org/wiki/Embarrassingly_parallel
but you need to hold the neural network weights in RAM, and therefore couple multiple GPUs together to be able to infer or train networks with billions or trillions of weights resp. parameters. Meanwhile LLMs use techniques like MoE, mixture of experts, so they can distribute the load further. Inference runs for example on a single node with 8 GPUs with up to 16 MoE nodes. The training of LLMs is yet another topic, with further techniques of parallelism so they can distribute the training over thousands of GPUs in a cluster:
1. data parallelism
2. tensor parallelism
3. pipeline parallelism
4. sequence parallelism
And then, power consumption of course. The Colossus supercomputer of the Grok AI with 100K GPUs consumes estimated 100MW power, so it does make a difference if the next fab process delivers the same performance at half the wattage.
Therefore it is important to invest in smaller chip fabrication process, to increase the size of neural networks we are able to infer and train, to lower power consumption, and to increase efficiency.