A tech buddy asked me why it is so important for China to catch up in chip fabrication process, can't they just put more servers into a data center? In short, it is not that easy.

By shrinking the fab process you can add more transistors onto one chip, and/or run at a higher frequency, and/or lower power consumption.

The fab process is measured in "nm", nanometers. Meanwhile these numbers do not reflect real scales anymore, but transistor density resp. efficiency of fab process.

Simplified, the MOSFET technology was used up to 22nm, this was a 2D planar transistor design, then from 14 to 7nm FinFET 3D structures, and below 7nm GAAFET 3D structures.

Take a look at the 7nm and 3nm fab process for example:

https://en.wikipedia.org/wiki/7_nm_process#Process_nodes_and_process_offerings
https://en.wikipedia.org/wiki/3_nm_process#3_nm_process_nodes

Roughly spoken, the 7nm process packs ~100M transistors per mm2, the 3nm process packs ~200M transistors per mm2.

And here the latency steps in. As soon as you leave as programmer the CPU you increase latency, this starts with different levels of caches, goes to RAM, goes to PCIe bus, goes to network...

Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference                           0.5 ns
L2 cache reference                           7   ns
Main memory reference                      100   ns
Send 1K bytes over 1 Gbps network       10,000   ns       10 us
Read 4K randomly from SSD*             150,000   ns      150 us
Read 1 MB sequentially from memory     250,000   ns      250 us
Round trip within same datacenter      500,000   ns      500 us
Read 1 MB sequentially from SSD*     1,000,000   ns    1,000 us    1 ms
Read 1 MB sequentially from disk    20,000,000   ns   20,000 us   20 ms
Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms

Source:
Latency Numbers Every Programmer Should Know
https://gist.github.com/jboner/2841832

As a low level programmer you want to stay on CPU and work preferred via the cache. As a GPU programmer there are several layers of parallelism, e.g.:

1. across shader-cores of a single GPU chip (with >10K shader-cores)
2. across multiple chiplets of a single GPU (with currently up to 2 chiplets)
3. across a server node (with up to 8 GPUs)
4. across a pod of nodes (with 256 to 2048 GPUs resp. TPUs)
5. across a cluster of server nodes/pods (with up to 100K GPUs in a single data center)
6. across a grid of clusters/nodes

With each layer adding increasing amounts of latency.

So as a GPU programmer you want ideally to hold your problem space in memory of, and run your algorithm on, a single but thick GPU.

Neural networks for example are a natural fit to run on a GPU, so called embarrassingly easy parallelism,

https://en.wikipedia.org/wiki/Embarrassingly_parallel

but you need to hold the neural network weights in RAM, and therefore couple multiple GPUs together to be able to infer or train networks with billions or trillions of weights resp. parameters. Meanwhile LLMs use techniques like MoE, mixture of experts, so they can distribute the load further. Inference runs for example on a single node with 8 GPUs with up to 16 MoE nodes. The training of LLMs is yet another topic, with further techniques of parallelism so they can distribute the training over thousands of GPUs in a cluster:

1. data parallelism
2. tensor parallelism
3. pipeline parallelism
4. sequence parallelism

And then, power consumption of course. The Colossus supercomputer of the Grok AI with 100K GPUs consumes estimated 100MW power, so it does make a difference if the next fab process delivers the same performance at half the wattage.

Therefore it is important to invest in smaller chip fabrication process, to increase the size of neural networks we are able to infer and train, to lower power consumption, and to increase efficiency.