What 24 Studies Actually Say About AI Code Quality: Prompt Design and Developer Expertise Dominate Outcomes
The industry has spent two years arguing about which AI model produces the best code. A new systematic literature review synthesizing 24 empirical studies suggests the question is largely the wrong one. The research, led by Vahid Garousi and colleagues, finds that developer expertise and task specification quality are the single strongest predictors of AI-generated code quality — outweighing model selection in nearly every analysis across the reviewed studies. Experienced engineers who provide detailed, well-structured task context consistently get better outputs than less experienced engineers with the same model.
The findings cluster into three tiers. Human factors dominate: how engineers specify tasks, the context they provide, and the constraints they include correlate more strongly with correctness, security, and maintainability than any model-level variable. Prompt design is the second major factor, with structured, specific prompts that include explicit constraints — not just requirements — producing measurably better results. AI system characteristics (model size, temperature, training approach) do matter, but their effects interact heavily with the first two categories, making them context-dependent rather than stable advantages that transfer across teams.
Perhaps the most practically significant finding is about measurement itself. The paper documents substantial fragmentation across studies: correctness, security, maintainability, and complexity are all used as quality proxies, but studies using different metrics reach inconsistent conclusions about the same tools. This fragmentation explains why internal quality metrics rarely correlate with public benchmarks — and why engineering teams need to define their own quality dimensions explicitly rather than borrowing from the literature. For organizations: the implication is that investments in hiring, onboarding, and prompt design standards likely have more leverage than model upgrades. For teams building coding agents: the measurement fragmentation finding is a direct argument for building your own evaluation harness rather than relying on published benchmarks.