xai

Grok Imagine Video Becomes a Real API Surface — Short Clips, Async Polling, and Costs Builders Can Model

Anatoliy Kolodkin

29 May 2026 • 5 min read

The useful version of AI video generation is not the viral clip. It is the boring API contract behind the clip: submit a job, poll for status, get a temporary asset, track cost, and decide what your product does when the model takes three minutes, fails moderation, or produces something the user immediately regenerates. xAI’s refreshed Grok Imagine Video docs are interesting because they move the feature from consumer toy territory into that developer surface.

The primary model is grok-imagine-video. The core REST flow is simple: POST https://api.x.ai/v1/videos/generations, receive a request_id, then poll GET https://api.x.ai/v1/videos/{request_id} until the job returns done, failed, or expired. Completed responses include a hosted video.url, duration, a respect_moderation field, and the model used. xAI says generation is asynchronous and typically takes up to several minutes depending on prompt complexity, duration, resolution, and whether the workflow is pure generation or editing.

That is not glamorous. It is exactly what builders need. The app-store version of AI video is a button. The production version is a job system.

Short clips are a constraint, not a weakness

xAI’s current limits make the product legible. Text-to-video generation supports clips from 1 to 15 seconds. Resolution options are 480p by default and 720p for HD. Supported aspect ratios include 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, and 2:3, with 16:9 as the default. Video editing retains the input duration, caps edited clips at 8.7 seconds, and outputs at the input resolution capped at 720p.

Those numbers say what the API is for: iteration, prototypes, short social creative, product mockups, onboarding snippets, game concept clips, internal design exploration, and programmatic media variants. This is not a long-form video production suite. That is fine. Most useful developer APIs start by being narrow enough to reason about. Fifteen seconds is enough to create a product shot, a character beat, a motion concept, or an ad variant. It is also short enough that teams can model latency, storage, moderation review, and unit economics without pretending they are running a studio pipeline.

The temporary URL behavior is another small detail with large product consequences. xAI-hosted output URLs are temporary, and the docs tell developers to access, download, or process them promptly if they need a durable copy. That means serious integrations need asset storage on their side. Do not hand the temporary URL to users and call the workflow done. Persist the request ID, prompt, model, duration, aspect ratio, resolution, status, final URL, download result, moderation flag, and error state. If the output becomes part of a customer project, move it into storage you control and attach provenance while you still can.

Reference images turn generation into brand infrastructure

The most commercially interesting mode is reference-to-video. xAI lets developers provide one or more reference images to influence people, objects, clothing, or other visual elements in the generated clip. Unlike image-to-video, where the input image becomes the starting frame, reference images guide what appears in the video without locking the first frame. xAI’s own docs position the mode for virtual try-on, product placement, and character-consistent storytelling.

That is where the API stops being a novelty and starts touching real business process. Text prompts are enough for ideation. Reference images are how a team tries to preserve product identity, packaging, wardrobe, character continuity, or a campaign’s visual language. A retailer might generate short try-on clips. A game studio might test animation beats for a recurring character. A marketplace seller might create product-placement variants. A growth team might generate dozens of short ad concepts around the same object.

It also raises the governance bar. If users can upload or link reference images of people, products, brand assets, or private prototypes, the application needs rights management and consent rules. Public HTTPS URLs and base64 data URIs are convenient input formats, not permission models. Who was allowed to use that face? Was the product image licensed? Can a user generate a clip implying endorsement? How long do reference assets remain in your system? Can enterprise admins disable reference mode or require approval for certain asset classes? The API makes the workflow possible; the product still has to make it responsible.

Image-to-video is simpler but still operationally relevant. xAI accepts a public image URL or base64 data URI, and the Vercel AI SDK path accepts URL strings, base64 strings, Uint8Array, ArrayBuffer, or Buffer. The output defaults to the input image’s aspect ratio unless overridden; overriding stretches the image to the requested aspect ratio. That last clause is the sort of thing that creates support tickets if hidden. If a user uploads a square product image and the app silently forces 16:9, the result may look broken even if the API behaved correctly. Surface the tradeoff in the UI.

The price is simple enough to be dangerous

xAI’s pricing page lists grok-imagine-video at $0.050 per second. A ten-second clip is nominally $0.50. A maximum fifteen-second generation is $0.75. That sounds cheap until a product encourages endless regeneration, runs multiple variants per prompt, or lets a background job churn through prompt experiments. The same lesson from coding agents applies: AI systems get expensive when iteration is frictionless and failure is quiet.

Batch API support does not remove that concern. xAI says image and video generation are supported in Batch API but billed at standard rates, unlike text and language batch requests that can receive 20% to 50% discounts. The pricing page also says requests deemed usage-guideline violations may still be charged for generation, and violations caught before generation in the Responses API can incur a $0.05 fee. Whether those exact charges apply to every video path is a billing-policy question teams should verify, but the product lesson is already clear: failed or rejected attempts can still have economic weight.

Builders should put video behind explicit quotas from day one. Estimate cost before submission. Show duration and resolution as cost-bearing choices. Rate-limit regenerate buttons. Use queues rather than synchronous spinners. Add per-user, per-project, and per-organization budgets. Log failed, expired, and moderated jobs separately from successful jobs. If your product has a “make ten variations” button, that is not one feature; it is a bill multiplier wearing a nice label.

The Vercel AI SDK support is a smart adoption move. JavaScript teams can call experimental_generateVideo with xai.video("grok-imagine-video") and retrieve the xAI video URL through provider metadata. That makes experimentation easier for teams already using the AI SDK for text or image workflows. But provider-specific behavior still leaks through: polling timeouts, polling intervals, resolution, request modes, temporary URLs, and xAI-specific metadata all matter. Abstractions are useful at the edge. Operational records should preserve the raw provider details.

The correct mental model is not “call a function, get a video.” It is “enqueue a media generation job.” Persist the job. Poll intentionally. Handle timeout and cancellation. Store inputs and output metadata. Download assets that matter. Track prompt, reference asset IDs, model, duration, aspect ratio, resolution, estimated cost, actual status, and moderation result. Build a UX that survives several minutes of waiting without encouraging users to mash the button and create duplicate spend.

Grok Imagine Video is useful precisely because the docs are constrained. Short clips. Explicit polling. Temporary URLs. Common aspect ratios. 480p and 720p. A price per second that product teams can calculate before the invoice arrives. The flashy part is the generated media; the valuable part is that xAI is turning it into a programmable primitive. That is how a demo becomes a feature teams can ship.

Sources: xAI Docs — Video Generation, Image-to-Video, Reference-to-Video, xAI Pricing, Vercel AI SDK video generation docs

Short clips are a constraint, not a weakness

Reference images turn generation into brand infrastructure

The price is simple enough to be dangerous

Sign up for more like this.