ai-models

Anthropic Just Put 10 AI Agents Into Production Finance Work — and Published the Benchmark That Shows What It Can't Do Yet

Anatoliy Kolodkin

06 May 2026 • 4 min read

Anthropic published a benchmark number this week that tells you exactly where Claude fails — and that is the most honest thing about the announcement.

The company released 10 financial services agent templates for production use with Claude, covering everything from KYC screening to earnings review to month-end close — and paired them with an Opus 4.7 score on the Vals AI Finance Agent benchmark: 64.37%. Anthropic called it "industry leading." The Register correctly noted that score would get a human analyst fired. Both things are true, and the gap between them is the actual story.

The 10 templates — pitch builder, meeting preparer, earnings reviewer, model builder, market researcher, valuation reviewer, general ledger reconciler, month-end closer, statement auditor, and KYC screener — are packaged as reference architectures combining three components: skills (markdown workflow files encoding domain knowledge), connectors (external service integrations), and subagents (specialized Claude API calls with focused system prompts and tool definitions). This is not a research demo. It is an architecture meant to be deployed.

The KYC screener is the clearest example of what Anthropic is actually selling. It produces structured JSON output: {"risk_rating": "low | medium | high", "disposition": "clear | request-docs | escalate-EDD | decline-recommend", "missing_documents": [...], "escalation_reasons": [...], "rule_outcomes": [...]}. That schema is not an AI output format. It is a compliance workflow format — the kind of structured output that integrates with existing corporate systems rather than requiring a human to interpret freeform text. Anthropic explicitly states users will "stay firmly in the loop — reviewing, iterating on, and approving Claude's work before it goes to a client, gets filed, or is acted on." The human stays in the loop because the benchmark says they have to.

That 64.37% deserves more attention than it is getting. In most software contexts, 64% would be considered a strong result. In financial services compliance, it means Opus 4.7 fails more than a third of real tasks — and the failures are likely to cluster in the highest-stakes cases, where the edge cases live. The interesting question for practitioners is not whether Claude is good enough to replace analysts. It clearly is not. The question is whether it is good enough to reduce the analyst's workload by 35% while keeping humans in the loop for everything material. That is a different product: augmented intelligence, not autonomous intelligence.

The concurrent $1.5 billion joint venture — $300 million each from Anthropic, Blackstone, and Hellman & Friedman, with participation from Goldman Sachs, Apollo, General Atlantic, GIC, Leonard Green, and Sequoia — is the bigger strategic signal, and it is separate from the agent templates announcement. But it belongs in the same picture: Anthropic is building the product architecture for real deployment in regulated industries, and Wall Street is funding it. JPMorgan CEO Jamie Dimon appeared at the May 5 announcement event alongside Anthropic's leadership. Dimon's presence is both a validation signal — JPMorgan's stamp of approval matters in financial services — and a reminder that Anthropic is building enterprise revenue ahead of a potential IPO.

The architecture is the more durable contribution. By separating skills (workflow knowledge), connectors (data access), and subagents (model calls), Anthropic has created something that looks like a deployable enterprise product rather than a research demo. The skills are markdown files that individual firms can customize. The connectors are governed integrations. The subagents are API calls with specialized prompts. Each piece can be audited, updated, and controlled independently — which is the pattern that makes AI agents actually deployable in regulated industries, where the compliance team needs to know exactly what the model did and why.

The benchmark transparency is worth crediting. Anthropic did not have to publish the 64.37%. It could have released the templates without a benchmark number, or led with a different number. The fact that they published the honest score — and their own editorial described it correctly — is more useful than the usual benchmark theater. Practitioners can make deployment decisions with actual data rather than inferring capability from marketing language. Whether 64.37% is good enough for a specific use case is a question that depends on the use case; at least the number is on the table.

The Dimon joint appearance also surfaced Amodei's explicit warning about the vulnerability discovery window: Chinese frontier AI is "roughly six to 12 months" behind Anthropic's Mythos model, meaning the current period is a narrow window for organizations to fix what AI has found before adversarial AI finds it too. Mythos found nearly 300 vulnerabilities in Firefox alone vs. roughly 20 for prior-generation models. The total unpatched vulnerabilities across all software: "tens of thousands." The finance agent templates are being released into a world where the defensive use case — using Claude to find vulnerabilities before attackers do — may be as important as the productivity use case. That context does not appear in the finance agents announcement, but it is the environment in which financial services firms are now evaluating these tools.

For teams evaluating these templates: the GitHub repository (github.com/anthropic/financial-services) is worth studying as a reference architecture regardless of whether you deploy Anthropic's specific implementation. The separation of skills, connectors, and subagents is a design pattern that transfers. The JSON output format for the KYC screener is a template for how compliance outputs should be structured. The benchmark number is data to run your own evaluation against, not a verdict. The honest question is whether your organization can absorb a tool that automates 64% of a compliance analyst's work while requiring human review on the remaining 36% — and whether that is better or worse than the alternative of not using AI for this work at all.

Anthropic published a benchmark number that tells you exactly where Claude fails. That is the starting point for a serious deployment conversation, not the end of one.

Sources: Anthropic, The Register, Vals AI Finance Agent Benchmark, GitHub: Anthropic Financial Services Agents

Sign up for more like this.