nvidia

Megatron Bridge on GB10 Shows the Painful Gap Between ‘Supported’ and ‘Usable’

Anatoliy Kolodkin

09 May 2026 • 5 min read

The useful NVIDIA forum posts are not always the polished ones. Sometimes the useful one is a frustrated builder saying Megatron Bridge is eating “endless VRAM” on a two-node GB10 setup and falling over at batch sizes where the simpler Hugging Face path survives.

That is not a definitive indictment of Megatron Bridge. It is better than that: it is an operationally honest report from the uncomfortable middle of NVIDIA’s AI stack. At one end, Hugging Face Transformers, Unsloth-style fine-tuning paths, and vLLM inference recipes are familiar enough that builders can usually get something running. At the other end, Megatron and Megatron Bridge exist to unlock serious distributed training, mixed precision, model conversion, tensor parallelism, and Transformer Engine performance. GB10/DGX Spark-class hardware sits in between: powerful enough to tempt developers into enterprise training stacks, constrained enough that every framework overhead and memory accounting mistake becomes visible immediately.

The gap between “supported” and “usable” lives there.

Two GB10s are already a distributed systems problem

The May 9 thread reports a concrete setup, which is why it is worth covering. The user is running nvcr.io/nvidia/nemo:26.04.00 on two GB10 systems connected with 200GbE InfiniBand. The goal is Qwen 3.5 9B training with a 64-rank PEFT/LoRA-style configuration, sequence length 2048, tensor parallelism set to 2, and FP8 ambitions through Transformer Engine. The launch path uses torchrun, scripts/training/run_recipe.py, a preloaded VLM dataset, a Qwen3-VL step function, local tokenizer/data paths, and recipe overrides.

The reported result: VRAM balloons and the run OOMs even at very low batch sizes and context lengths, including when sharding with tensor parallelism. A follow-up says memory appears to allocate far more VRAM than it actually uses, with screenshots attached. The author also says equivalent Hugging Face Transformers paths behave better.

That comparison matters. If every stack OOMs, the lesson may simply be that the model and recipe do not fit. If the simpler stack fits and the high-performance stack does not, the question becomes sharper: is this a configuration issue, a recipe maturity issue, a container issue, a memory fragmentation problem, an activation accounting problem, a Transformer Engine path mismatch, or overhead that is not amortized at GB10 scale?

None of those answers is exotic. Training frameworks carry machinery. Tensor parallelism can reduce per-device parameter load, but it does not erase activation memory, optimizer state, communication buffers, temporary allocations, CUDA graph behavior, framework bookkeeping, or distributed launcher overhead. PEFT helps, but it is not a spell. FP8 can improve memory and throughput when the kernels, model path, recipe, and hardware support line up cleanly. When they do not, “FP8 capable” becomes a procurement phrase rather than a working training run.

Megatron Bridge is moving fast, which means edges are part of the product surface

NVIDIA positions Megatron Bridge as a PyTorch-native NeMo library for pretraining, supervised fine-tuning, LoRA, Hugging Face/Megatron checkpoint conversion, model-parallel training, and mixed precision paths including FP8, BF16, and FP4. The project is real and active. Megatron Bridge 0.4.0, released in April, added support for Kimi 2.5, Nemotron 3 Super, Qwen 3.5 VL, MiniMax M2, Sarvam, MiMo, diffusion models, sequence-packing improvements, FP8 export, pruning and quantization, Transformers 5.x compatibility, and Python 3.12 migration. Qwen3.6-35B-A3B support landed shortly afterward through the existing Qwen3.5-VL bridge, with Hugging Face to Megatron conversion and inference verified.

That pace is impressive. It is also exactly the kind of pace that creates sharp edges. A framework supporting a model family does not mean every downstream recipe is memory-stable across every workstation-class Blackwell configuration, container version, precision mode, and distributed topology. This is not unique to NVIDIA. It is what happens when high-performance ML infrastructure tries to compress enterprise-scale assumptions into smaller developer systems.

The related GB10 forum result from earlier in the week points in the same direction: another user asked whether Megatron training with the NeMo/Megatron connector was unsupported on GB10 after seeing poor performance under Kubernetes with the same nvcr.io/nvidia/nemo:26.04 container, while raw eager PyTorch and more efficient fine-tuning paths performed materially better. Two threads are not a statistical study. They are enough to call the pattern worth watching.

The procurement lesson: validate the exact recipe, not the logo

For teams evaluating GB10, DGX Spark, or similar local NVIDIA systems, the lesson is not “avoid Megatron Bridge.” The lesson is to build a validation matrix before treating any stack as production-ready. Test a Transformers baseline. Test an efficient fine-tuning baseline. Test the Megatron Bridge recipe. Test single-node and two-node. Test tensor parallelism on and off. Test FP8, BF16, gradient checkpointing, sequence length scaling, and batch-size scaling. Capture memory telemetry, not just success or failure. Change one variable at a time.

That sounds tedious because it is. It is also cheaper than buying hardware around a workflow that only fits in the abstract. “Supported” often means the model can be converted, launched, or exercised in a known-good path. “Usable” means your exact model, data shape, PEFT configuration, context length, precision mode, container, driver, networking, and failure budget work without a heroic engineer watching nvidia-smi like a heart monitor.

Builders should also be honest about when the enterprise stack is unnecessary. If the job is a small LoRA fine-tune on Qwen 9B, a simpler path that fits reliably may beat a sophisticated path that theoretically unlocks better performance but burns days of debugging. High-performance infrastructure earns its keep when the workload is large enough and stable enough to amortize its complexity. On constrained local systems, complexity has to prove itself.

NVIDIA can make this easier. GB10-specific recipes would help. Known-good container matrices would help. Expected VRAM budgets per model and recipe would help even more. So would troubleshooting docs that distinguish “your batch size is too high” from “this path is allocating unexpected buffers under TP=2.” The audience for local AI hardware is increasingly made of builders who are capable but not full-time Megatron specialists. Good defaults are part of the product.

The bigger trend is that local AI builders are inheriting cluster problems earlier than expected. Two boxes and a 200GbE link already introduce distributed launchers, tensor parallelism, container compatibility, memory instrumentation, and network behavior. That is a lot of infrastructure for something often marketed as a workstation story. The honest forum post is valuable because it punctures the brochure version. Serious local AI is still infrastructure work.

Megatron Bridge may be the right long-term path for serious NVIDIA training workflows. The May 9 GB10 report does not disprove that. It simply reminds everyone that the path from capability to reliability runs through logs, memory traces, and small reproducible failures. Looks good on the architecture diagram is not the same as LGTM in production.

Sources: NVIDIA Developer Forums, NVIDIA-NeMo/Megatron-Bridge, Megatron Bridge documentation, related GB10 Megatron thread, vLLM Qwen guide

Two GB10s are already a distributed systems problem

Megatron Bridge is moving fast, which means edges are part of the product surface

The procurement lesson: validate the exact recipe, not the logo

Sign up for more like this.