NVIDIA's Next AI Bottleneck Is Distance, Not Compute

NVIDIA's Next AI Bottleneck Is Distance, Not Compute

NVIDIA's next bottleneck is no longer compute. It is distance.

That sounds almost too simple for a company that has spent the last three years turning AI infrastructure into a geopolitical supply chain, but it is the cleanest way to read the optics story now coming into focus. NVIDIA can keep making bigger GPUs, denser racks and fatter switch fabrics, but at some point physics stops accepting executive optimism as a substitute for signal integrity. Copper has been the workhorse of the Blackwell era because it is cheap, reliable and miserly on power. It is also running out of runway.

The key detail from The Register's reporting is not the headline-grabbing promise of systems with more than 1,000 GPUs by 2028. It is the admission that current copper-based NVLink scale-up fabrics are already constrained by bandwidth and cable length. In the Grace Blackwell NVL72 generation, NVIDIA used a copper backplane with miles of cable to make 72 GPUs behave like one giant accelerator. At roughly 1.8 TB/s, those links only stretch a few feet before the signal degrades badly enough to dictate physical layout. That is why so much of the rack looks like it was designed by someone trying to win a cable-management bet against Maxwell's equations.

Jensen Huang's GTC comments made the company's position unusually explicit. NVIDIA needs more capacity for copper, more capacity for optics and more capacity for co-packaged optics because the scale target has moved again. Vera Rubin NVL576 and the later Feynman NVL1152 systems are the architectural tell here. NVIDIA is not talking about optics as a nice-to-have network upgrade for hyperscalers. It is making optical interconnects part of the core scale-up story for its own AI factories.

This is a packaging story dressed up as a networking story

Most coverage of photonics in AI infrastructure defaults to the same broad point: optical links can move more data farther than copper. True, but incomplete. The more interesting shift is where NVIDIA is choosing to spend engineering complexity. In the Blackwell generation, complexity lived in the rack. You could brute-force your way to an NVL72 domain with enough copper, enough switching and enough tolerance for a 120 kilowatt machine that feels like a thermal engineering stress test. In the Rubin and Feynman era, complexity migrates into packaging, module design and supply chain control.

That is why the company's recent spending matters. NVIDIA has put billions into optics-adjacent players including Marvell, Coherent and Lumentum. Coherent and Lumentum strengthen access to laser supply, which becomes more important as co-packaged optics moves from conference demo to manufacturing dependency. Marvell matters because the line between optical I/O, custom XPUs and coherent memory fabrics is getting blurry fast. If NVIDIA wants NVLink to span more racks, more accelerators and potentially more custom silicon built by partners, optics stops being a component choice and becomes a platform choice.

There is a second-order effect here that deserves more attention. Once optical scale-up is part of the product roadmap, the competitive moat shifts a little. CUDA still matters. Blackwell and Rubin still matter. But the hard problem becomes end-to-end system integration: which company can coordinate GPUs, switches, lasers, packaging, thermals and topology well enough that the cluster behaves like one machine instead of a rack cemetery full of expensive edge cases. That is a much harder market for a fast follower to enter.

Copper is not dead. It is being pushed down the stack

One of the smarter details in NVIDIA's messaging is that it is not pretending optics will erase copper overnight. Huang said the answer is both. Early Rubin systems are expected to keep copper inside the rack and use optical links higher up the topology. That is the pragmatic move. Copper still wins when the distances are short, the cost sensitivity is real and the power budget is under pressure. Optics wins when the ambition is to treat multiple racks as one scale-up domain without making cable length the primary systems architect.

That hybrid approach matters for practitioners because it means the transition will be uneven. Operators planning clusters over the next two years should not assume a clean generational cutover where copper equals old and optical equals new. What is more likely is a layered environment where in-rack links, switch design and rack topology continue to look familiar while spine layers and multi-rack fabrics start changing quickly. Software teams may not care what medium carries the bits, but platform teams absolutely should, because the failure domains, latency profiles, serviceability assumptions and power envelopes will change with it.

This is also where the co-packaged optics conversation gets real. Traditional pluggable optics are workable, but power-hungry at the scale NVIDIA wants. Huang noted last year that using pluggable optics for systems like these would have added roughly 20 kilowatts. That is not a rounding error in an already brutal rack budget. Co-packaged optics reduces the number of pluggables and pushes the optical engine closer to the ASIC. The architectural payoff is obvious. The operational downside is that serviceability gets harder and the manufacturing dependency chain gets tighter. Put differently, AI infrastructure is starting to look even more like advanced semiconductor manufacturing and less like conventional datacenter procurement.

What builders should do now

If you run infrastructure, this is the moment to stop treating interconnect as a boring line item. The economics of large-model training and inference are increasingly dominated by the efficiency of the whole system, not just the FLOPS number on the accelerator data sheet. Teams buying or designing clusters should be asking vendors uncomfortable questions about topology, oversubscription, cable reach, optical roadmaps and what happens when one specialty supplier misses a quarter. If the answer is just "we have more GPUs," that answer is already incomplete.

If you build distributed training or inference software, assume bigger failure domains are coming. Systems that span many more accelerators inside a tighter scale-up domain will shift performance expectations upward, but they will also make scheduling, observability and blast radius more consequential. Software that was merely good enough on smaller fabrics will get exposed when clusters behave more like giant shared-memory machines with extremely expensive edge cases.

And if you are a startup building around AI infrastructure, watch where NVIDIA is leaving surface area. Optical scale-up creates openings in validation, telemetry, fault isolation, thermal management and lifecycle operations. The headlines will go to lasers and photonics. The money may end up going to whoever makes these systems less miserable to deploy.

The larger editorial point is simple: NVIDIA is no longer just extending the GPU roadmap. It is rewriting the physical assumptions underneath the AI datacenter. Optical interconnects are not marketing garnish on top of Blackwell and Rubin. They are the mechanism that lets NVIDIA keep selling the bigger-box future it has already promised Wall Street and hyperscalers. The company learned the same lesson the rest of high-performance computing eventually learns: when you cannot shrink the workload, you have to shrink the distance between the parts that matter, or find a better medium for the distance that remains.

That medium is increasingly light. And once that becomes true, the AI race starts looking less like a chip war and more like a systems engineering war with lasers attached.

Sources: The Register, NVIDIA Newsroom, The Next Platform