nvidia

NVENC Is Becoming the Weirdly Practical Fabric for Local Multi-GPU AI

Anatoliy Kolodkin

16 May 2026 • 5 min read

Consumer multi-GPU AI usually dies in the gap between what the hardware theoretically has and what the runtime can actually use. Two cards in two boxes might add up to enough VRAM on paper. In practice, the missing fabric turns that second GPU into an expensive space heater unless the workload can tolerate slow coordination. ComfyUI Mesh is interesting because it attacks that boring gap with a wonderfully sideways idea: use NVIDIA’s video encoder blocks as an activation transport.

The project, created by shootthesound, splits FLUX.2 across two NVIDIA GPUs using an Icarus ComfyUI client node and a Daedalus back-half server. The two cards can live in the same machine, across a LAN, or even across a VPN. Instead of shipping raw activation tensors over the wire, ComfyUI Mesh packs them into video-like frames, quantizes them per channel to uint8, compresses them through NVENC as HEVC, sends the bitstream over TCP, and decodes them with NVDEC on the receiving GPU.

That sounds cursed until you remember the hardware reality. Modern NVIDIA cards ship with dedicated NVENC/NVDEC silicon that is separate from CUDA cores and often idle during ML inference. NVIDIA’s own Video Codec SDK documentation explicitly lists video data compression and decompression for deep learning as a suitable application. ComfyUI Mesh turns that footnote into a working local-AI fabric.

The clever part is not compression. It is using idle silicon at the bottleneck.

The headline number is the hook: FLUX.2 Klein 9B at 1024×1024 in about 4.4 seconds per image across an RTX 5090 and RTX 4090 over plain gigabit Ethernet. The project reports roughly 0.5 seconds of total wire overhead across four sampler timesteps, with about 130 milliseconds per timestep for encode, LAN transfer, remote forward, LAN return, and decode. Its diagram shows around 10 MB crossing the wire per step.

Those numbers need independent reproduction before anyone treats them as procurement-grade. But the shape of the result is plausible, and that is the important part. Diffusion activations are large, network links are weak, and consumer GPUs no longer give local builders a clean NVLink story. If compression can move the inter-GPU boundary from “unusable over gigabit” to “annoying but workable,” the economics of local image-generation rigs change.

The underlying torch-nvenc-compress repo makes the broader claim explicit: 6.1× lossless compression on FLUX activations, 3× lossless compression on LLM KV cache, and sub-millisecond codec timings on its MultiEngineDirectBackend — 0.180 ms encode and 0.262 ms decode in the cited numbers. That does not mean every end-to-end workload becomes 6× faster. Systems never let you cash the whole benchmark check. Scheduling, synchronization, kernel overlap, and transport behavior take their cut. But it does mean the primitive is real enough to deserve attention.

Local AI is becoming a topology problem.

Supported models today are FLUX.2 Dev and FLUX.2 Klein 9B. The roadmap lists Wan, LTX-Video, FLUX.1, SD3/SD3.5, HunyuanVideo, Qwen-Image, and Chroma. That matters because image and video generation are where local builders most visibly hit the wall: huge models, high-resolution outputs, VRAM pressure, and hardware that exists in mismatched piles rather than clean datacenter racks.

The project’s first-day traction reflects that reality. At research time, shootthesound/comfyui-mesh had 48 GitHub stars and 4 forks, while the codec repo had 18 stars and 2 forks. More telling, the launch thread on r/StableDiffusion had 415 upvotes and 90 comments. That community is not abstractly excited about distributed systems. It has 4090s, 5090s, older 3080s, laptops, spare boxes, and a very concrete desire to run models that barely fit.

This is why the project is more than a ComfyUI novelty. It points at a missing abstraction in consumer AI runtimes: activation and KV transport should be a first-class concern. Today, local builders choose between buying one larger card, accepting slow CPU offload, manually sharding models, or giving up. Datacenter inference stacks have sophisticated routing, cache management, prefill/decode separation, and fabrics designed for the workload. The home and prosumer side gets PCIe, Ethernet, maybe Thunderbolt, and a prayer.

ComfyUI Mesh is not a clean universal answer. It is a proof that the transport layer can be workload-aware. Diffusion activations can tolerate certain lossy modes better than other tensors. LLM KV cache has a different failure profile. Gradients for training are different again. A wrong pixel is visible but often acceptable; a corrupted attention state can poison a generation; repeated gradient error can quietly degrade training. The engineering lesson is not “turn every tensor into a video.” It is “validate compression against the semantics of the tensor you are moving.”

The operational advice: test quality, not just speed.

If you are a ComfyUI user who is VRAM-bound on FLUX.2 and have another NVIDIA card reachable over LAN, this is worth testing in a sandbox. Use the project defaults first: NVENC mode, QP 18, matching remote block counts between Icarus and Daedalus, and fixed seeds for output comparison. Measure wall-clock time, server byte counts, per-step latency, GPU memory pressure, and image quality against a single-GPU baseline if you can run one.

Pay attention to the limitations. The server is single-tenant. The sampler path is sequential request/response. CUDA-stream overlap is not implemented in ComfyUI Mesh yet. Same-host dual-GPU support is prepared but not fully end-to-end tested by the author, and the README sensibly recommends raw mode rather than NVENC for same-host PCIe because PCIe bandwidth is already high enough that codec overhead can lose. This is early software. Treat it like early software.

The LoRA handling is also a useful reminder that distributed local AI is mostly details. The README warns that large LoRAs, especially a roughly 2.5 GB FLUX.2 turbo LoRA, should not be forwarded across the wire because doing so can push the smaller server GPU into memory pressure and kill the speed-up. The recommended pattern is to load the large LoRA on both sides and avoid transmitting it. That is not glamorous, but it is the difference between a clever demo and a usable workflow.

For NVIDIA, the irony is excellent. NVENC was a streamer and media feature. In local AI, it may become an escape hatch for activation movement on machines that were never supposed to behave like a cluster. That does not replace CUDA, TensorRT, NIM, or proper distributed runtimes. It complements them at the messy edge where builders own mismatched GPUs and weak links.

The bigger takeaway is that local AI is becoming less about a single heroic GPU and more about topology: where the model lives, where activations move, which silicon is idle, which links are slow, and which tensors can be compressed without breaking the result. ComfyUI Mesh is rough, narrow, and very much a practitioner hack. That is exactly why it is worth paying attention to. The next useful local-AI runtime may look less like a model loader and more like a tiny scheduler for all the weird hardware you already own.

Sources: shootthesound/comfyui-mesh GitHub, torch-nvenc-compress, NVIDIA Video Codec SDK, Black Forest Labs FLUX.2 Klein 9B model card, r/StableDiffusion launch thread

The clever part is not compression. It is using idle silicon at the bottleneck.

Local AI is becoming a topology problem.

The operational advice: test quality, not just speed.

Sign up for more like this.