The  LGTM
  • Home
  • Agentic Coding
  • Claude Code
  • Codex
Sign in Subscribe
Trajel Audits the Part of Agent Failures Your Final-Answer Checker Never Sees
ai-models

Trajel Audits the Part of Agent Failures Your Final-Answer Checker Never Sees

Final-answer evaluation is a terrible way to judge agents. It tells you whether the last sentence looked right, not whether the system took a safe path to get there. Trajel, a new dataset and evaluation framework for trajectory-level hallucinations, attacks exactly that blind spot by labeling failures inside the Thought-Action-Observation
27 May 2026 3 min read
AKBE Says Tool-Calling Agents Need to Know When Not to Search
ai-models

AKBE Says Tool-Calling Agents Need to Know When Not to Search

The most expensive tool call in an agent system is the one the model never needed to make. AKBE — Agentic Knowledge Boundary Enhancement — is a research method aimed at that very unglamorous, very real failure mode: agents that learn to search, browse, query, or call tools even when their own
27 May 2026 3 min read
MUSE-Autoskill Treats Agent Skills Like Packages That Need Tests, Memory, and Owners
ai-models

MUSE-Autoskill Treats Agent Skills Like Packages That Need Tests, Memory, and Owners

Agent skills are crossing the line from clever prompt folders into runtime dependencies. That is the useful way to read MUSE-Autoskill, a new research framework for self-evolving agent skills: not as another “agent gets better over time” demo, but as a warning label for what happens when reusable behavior starts
27 May 2026 3 min read
Cognition’s $1B Raise Turns Codex-vs-Copilot Into a Runtime-Economics Fight
codex

Cognition’s $1B Raise Turns Codex-vs-Copilot Into a Runtime-Economics Fight

Cognition raising more than $1 billion is venture-capital theater. The useful signal is underneath it: coding-agent competition is moving away from “which model writes the prettiest function” and toward runtime economics — orchestration, cost control, model routing, enterprise integration, and measurable delivery outcomes. That is a much more uncomfortable fight for
27 May 2026 5 min read
codex

OpenAI’s Tax AI Case Study Shows Codex Is Becoming an Improvement Loop, Not Just a Coding Tool

OpenAI’s new Tax AI case study is easy to misread as another “AI automates paperwork” story. That is the boring version. The interesting version is that Codex is being used less like a code generator and more like a product-improvement loop: experts correct production outputs, traces preserve what happened,
27 May 2026 4 min read
claude-code

Claude Code’s Security Plugin Turns AI Code Review Into an In-Session Feedback Loop

Claude Code’s new security-guidance plugin is not interesting because it promises to make AI-written code safe. That would be the usual product-demo overclaim, and thankfully Anthropic’s own docs do not make it. The interesting part is more practical: Claude Code is starting to treat security review as something
27 May 2026 5 min read
Qwen Code v0.16.2 Turns Agent Governance Into the Stable Release
qwen

Qwen Code v0.16.2 Turns Agent Governance Into the Stable Release

Qwen Code v0.16.2 is not trying to win the week with a bigger model card. Good. The useful story is smaller and more operational: Alibaba’s terminal agent is turning the ugly parts of agentic coding — local memory boundaries, runaway background work, shell-risk policy, credential leakage, context compaction,
27 May 2026 6 min read
openclaw

OpenClaw’s Tailscale No-Auth Fix Is Gateway Policy That Should Have Been Impossible to Misconfigure

OpenClaw’s Tailscale no-auth bug is the kind of security issue that should not be solved with better documentation. It should be made impossible to misconfigure. PR #87286 moves the project in that direction by rejecting the dangerous combination of gateway.auth.mode = "none" and Tailscale Serve or
27 May 2026 4 min read
OpenClaw 2026.5.26 Is Where the Runtime Starts Owning Its Operational Mess
openclaw

OpenClaw 2026.5.26 Is Where the Runtime Starts Owning Its Operational Mess

OpenClaw 2026.5.26 is not the sort of release that wins a demo. Good. Agent platforms have had enough demos. This stable rollup is interesting because it reads like a project realizing that always-on agents are not chat toys with extra buttons; they are distributed systems with transcripts, channel
27 May 2026 4 min read
openclaw

OpenClaw’s Dashboard Compact Button Can Turn Claude’s 1M Context Into Silent Amnesia

OpenClaw’s Dashboard compact button has found one of the worst possible ways to fail: it can make a user believe they are saving context while the runtime is actually preserving an empty apology note. Issue #87284 is not just another compaction bug. It is a reminder that context is
27 May 2026 4 min read
RTX PRO 4500 Blackwell Shows Edge AI Hardware Is About Workflow Latency, Not Spec Sheets
nvidia

RTX PRO 4500 Blackwell Shows Edge AI Hardware Is About Workflow Latency, Not Spec Sheets

The least interesting thing about NVIDIA’s RTX PRO 4500 Blackwell benchmarks is that a newer GPU beats an older one. That is not news; that is gravity. The useful story is what kind of workload NVIDIA chose to benchmark: genomics, variant calling, protein-structure inference, and Smith-Waterman alignment. This is
27 May 2026 4 min read
CompileIQ Turns the Compiler Into an Inference Optimization Surface
nvidia

CompileIQ Turns the Compiler Into an Inference Optimization Surface

CompileIQ is not exciting because NVIDIA has discovered that compilers matter. Anyone who has stared at PTXAS output at midnight already knew that. It is interesting because NVIDIA is turning compiler behavior into something closer to a reviewed, benchmarked, versioned production artifact — which is exactly where the last few percent
27 May 2026 4 min read
← Newer Posts Page 25 of 111 Older Posts →
The LGTM © 2026
  • Sign up
Powered by Ghost