google-ai

Gemini File Search Now Does Multimodal RAG — and Google Is Betting You'll Stop Building Your Own Pipeline

Anatoliy Kolodkin

06 May 2026 • 5 min read

There is a version of this story that is just a product launch: Google added multimodal retrieval to the Gemini API's File Search tool, which now handles images and text in the same index, with custom metadata filtering and page-level citations. Those are real features. They are also not the most interesting thing about what Google announced on May 5.

The more interesting thing is the pricing model. Google is offering File Search with a free storage tier and free query embeddings — you only pay for the initial indexing embeddings and standard Gemini input and output tokens. Let that sink in for a moment. Storage is free. Query embeddings at retrieval time are free. The only costs are indexing once and reading the results. For teams that have been running their own vector databases — Pinecone, Weaviate, Qdrant, whatever your preferred flavor — that pricing structure is a deliberate provocation. You are paying for infrastructure you have to operate, scale, and maintain. Google's pitch is that you can eliminate that operational burden and replace it with an API call, and the math is designed to make that trade look obviously attractive for most teams.

The citation feature is the most underrated part of this announcement, and it is not close. RAG systems fail in production not because retrieval is wrong, but because users do not trust retrieval. They cannot see why the model cited that passage, they cannot verify the source quickly, and they end up doing the original lookup manually anyway. Page-level citations that include the actual page number and a downloadable reference to the source material change the user experience of RAG fundamentally. For compliance workflows, legal research, medical literature review, or any domain where accuracy is verifiable and consequential, citation granularity is not a nice-to-have. It is the difference between a system people trust and a system people verify manually. The announcement frames citations as a trust-building feature, which is correct, but undersells how much it changes the product contract with end users.

The metadata filtering is the second underrated feature. The announcement frames it as a quality improvement — reducing noise from irrelevant documents by scoping retrieval to specific data slices — but it is actually an access control mechanism in disguise. If you index your data with metadata fields like department: Legal or confidential: true, you can scope query-time retrieval to the slice of the index a given user is allowed to see. That is row-level security applied to vector retrieval, without you having to build separate indexes per permission level. For enterprise RAG deployments where different users should see different document subsets based on their role or clearance, this is a significant operational simplification. You no longer need to partition your vector store by permission level and route queries to the right partition — you can index everything once and filter at query time. Whether that tradeoff is right for your latency and scale requirements is a question you will have to answer with benchmarks. But the option now exists as a first-class API feature rather than an architectural workaround.

The multimodal support is the headline, powered by Gemini Embedding 2 — the same model Google put into general availability on April 22. To understand why the multimodal claims have real backbone rather than just marketing, it helps to look at the benchmarks Google published alongside Embedding 2's GA. Gemini Embedding 2 posted 69.9 on multilingual MTEB mean task score and 84.0 on MTEB code mean task score. Those are not cherry-picked single-task numbers — MTEB is a broad evaluation suite covering dozens of retrieval tasks across multiple languages and domains. A score of 84 on code retrieval means the model understands code structure well enough to match queries against implementation files, test cases, documentation, and bug reports across a codebase. A score of 69.9 on the multilingual mean means that capability holds across a wide range of languages, not just English. Without those benchmarks, "we process images natively" is a product promise. With them, it is a benchmark-validated capability — which is a meaningfully different claim for teams making architectural decisions about which embedding provider to bet on.

One concrete example from the announcement illustrates what native multimodal retrieval actually means in practice: searching an image archive using natural language descriptions of visual tone or style, not filenames or keywords. You can ask "show me photos with a dark, moody atmosphere" and the model retrieves based on visual content understanding, not metadata. You can ask "find screenshots that show a loading error state" and get results based on what is actually on the screen. That is meaningfully different from a pipeline that caption-images-then-search-text, because captioning is a lossy operation — it discards visual information that does not fit into the captioner's training distribution. Native multimodal embedding preserves more of the original signal in the retrieval index, which means the search is operating closer to the raw visual content rather than a model's summary of it.

The vendor lock-in concern is real and the announcement does not address it. When you index your data in Google's managed File Search, you are storing it in Google's infrastructure. The announcement is silent on data residency, deletion guarantees, and export paths. For teams in legal, healthcare, or government sectors where data sovereignty requirements are real constraints, that silence is not reassuring. The managed service pitch works when the operational savings are large enough that the lock-in risk is acceptable. It is worth making that trade explicit in your architectural review before migrating production data to any managed retrieval service.

The comparison to building your own vector pipeline is where Google's positioning is sharpest. "No vector databases to provision or embedding pipeline to maintain" — that line from the announcement is direct about what they are arguing. The counterargument is not that managed services are always worse. It is that self-hosting gives you control over latency, data residency, and the ability to swap embedding providers without reindexing. The real answer for any given team depends on three things: how much operational overhead your current vector DB setup is actually costing you, how sensitive your use case is to the lock-in risk, and whether Google's API pricing at your expected query volume works out cheaper than running your own infrastructure. The announcement makes the first number easy to calculate. The other two are still your problem.

The timing of this announcement — one day after the webhooks launch, one day before a weekend — suggests Google is running a developer tooling two-pack: infrastructure for async agentic workflows on one day, managed retrieval for grounded agent responses the next. Whether that is deliberate sequencing or coincidence, it tells you something about where Google is investing: not in frontier model capabilities, but in the operational infrastructure that makes agentic systems reliable enough to run in production. That is a coherent product strategy. It is also the kind of investment that matters more to builders than another benchmark leaderboard position.

Sources: Google Developer Blog, Gemini API File Search documentation, DEV Community Guide

Sign up for more like this.