nvidia

CompileIQ Turns the Compiler Into an Inference Optimization Surface

Anatoliy Kolodkin

27 May 2026 • 4 min read

CompileIQ is not exciting because NVIDIA has discovered that compilers matter. Anyone who has stared at PTXAS output at midnight already knew that. It is interesting because NVIDIA is turning compiler behavior into something closer to a reviewed, benchmarked, versioned production artifact — which is exactly where the last few percent of inference performance should live.

The tool is a Python-based auto-tuning framework that searches internal NVCC and PTXAS compiler-control configurations for a specific workload. Instead of asking developers to accept generic compiler heuristics, CompileIQ benchmarks candidate configurations against a user-defined objective and emits an advanced controls file, or ACF, that can be applied during compilation with --apply-controls. In plain English: the compiler gets knobs, the benchmark chooses among them, and the winning configuration becomes part of the build.

The useful part is not magic tuning

NVIDIA says CompileIQ uses evolutionary and genetic algorithms to explore decisions like register allocation strategies, instruction scheduling policies, and loop transformations that are not exposed as ordinary public flags. It ships with CUDA 13.3-era search spaces for PTXAS and NVCC and can be installed with pip install compileiq. The examples are intentionally low ceremony: compile with a candidate controls file, run a benchmark, extract a score, repeat until the search converges.

The post includes a simple NVCC reduction example using a 10-generation, pool-15 search. The baseline was 0.777 ms, the optimized run was 0.770 ms, roughly a 1.01x speedup, and the sample search took 9 minutes and 29 seconds. That is not a fireworks demo, and that is good. It makes the tool feel like engineering rather than a press-release teleportation device.

NVIDIA also says production validation and a GTC session showed up to 15% improvement on both TritonBench and Helion kernels. That is the number inference teams will notice, because a 15% improvement on a dominant kernel is real money. But the smaller example is more honest about the workflow: sometimes the gain is tiny, sometimes the search takes time, and sometimes the real win is knowing that the compiler’s default choice was already good enough.

Inference teams should care because GEMM and attention pay the bills

The economic hook is straightforward. NVIDIA says GEMMs in linear layers plus Q/K/V/output projections account for about 70% of LLM inference FLOPs, while scaled dot-product, fused, and flash attention account for another 25%. Together, GEMM and attention represent more than 90% of end-to-end inference compute.

That is where compiler tuning becomes more than a specialist hobby. If your service spends most of its GPU budget in a small set of hot kernels, improving those kernels by even a modest percentage compounds across every request. A team serving internal coding agents, multimodal chat, embeddings, rerankers, or long-context retrieval systems may not care about the compiler in the abstract. It cares when the monthly GPU bill drops or p99 latency stops missing the SLO.

The mistake would be treating CompileIQ as a replacement for profiling. It is not. If the kernel is bad, the compiler will not rescue it. If the benchmark is fake, the search will optimize the fake benchmark. If the objective ignores p99 latency, warmup behavior, power, batch mix, or shape distribution, the resulting controls file may be perfectly optimized for the wrong workload. Auto-tuning makes search cheaper. It does not make measurement honest.

That is why the objective function is the product surface. The team defines what “better” means: runtime, compile time, power, or a multi-objective trade-off across them. CompileIQ can compute a Pareto frontier of non-dominated configurations, which is more useful than a single “fastest” answer in production. The fastest kernel may burn too much power for an edge box. The lowest-power configuration may be unacceptable for a latency-sensitive agent loop. A CI build may care about compile time. A batch inference job may care about throughput per watt.

Version the ACF, or enjoy your future incident

The output of CompileIQ is an advanced controls file applied during compilation: for example, ptxas --apply-controls config.acf my_kernel.ptx or nvcc --apply-controls reduction_best_config.bin. That file should not become a mysterious blob passed around in Slack. It should be versioned next to the kernel source, tied to compiler and driver versions, validated in CI, and invalidated when the benchmark or workload changes.

This is the part many teams will underinvest in. Compiler-control artifacts create governance work. Someone has to know which shapes were benchmarked, which GPU architecture was used, what driver and compiler versions were in play, whether the win holds across representative traffic, and whether the controls file is still valid after a kernel rewrite. If that sounds boring, congratulations, you have found the actual infrastructure work.

The upside is that this makes a previously artisanal layer more reviewable. Today, performance work often lives in scattered benchmark scripts, tribal memory, profiler screenshots, and a few carefully chosen flags. CompileIQ gives teams a path to encode some of that last-mile tuning as a reproducible artifact. That does not remove expert judgment; it gives experts a cleaner loop.

The tool is also early. During research, the GitHub repository had 8 stars, 0 forks, and 0 open issues, which is less an adoption signal than a timestamp. HN and Reddit had no meaningful direct discussion yet. That is normal for a compiler-tuning framework launched inside a CUDA release. The users who need this are not farming engagement. They are trying to shave latency from a kernel that already survived the obvious optimizations.

So the practical rollout should be narrow. Identify the top two or three kernels that dominate production cost. Lock a representative benchmark matrix: real shapes, real batch distributions, warmups, p50/p95/p99 latency, power where relevant, and enough repetitions to avoid fooling yourself. Run CompileIQ against those kernels only. Commit the winning ACFs with metadata. Add CI checks that catch regression or staleness after compiler, driver, hardware, or kernel changes.

For teams using Triton, CUTLASS, Helion, or custom CUDA in inference paths, that workflow is worth experimenting with. For teams still guessing where time goes, it is premature. The profiler comes first. CompileIQ belongs after you know the bottleneck and before you decide the only solution is more hardware.

The larger story is that the compiler has entered the inference runtime budget. Not as a theoretical layer below the application, but as a tunable component with measurable operational consequences. That is where it belongs. The last few percent should not live in folklore.

Sources: NVIDIA Developer Blog, NVIDIA CompileIQ, CompileIQ documentation, NVIDIA CUTLASS, Triton

The useful part is not magic tuning

Inference teams should care because GEMM and attention pay the bills

Version the ACF, or enjoy your future incident

Sign up for more like this.