OmniRetrieval Says RAG Should Stop Flattening Every Data Source Into the Same Vector Soup

OmniRetrieval Says RAG Should Stop Flattening Every Data Source Into the Same Vector Soup

Most RAG stacks are built around a convenient fiction: every useful source of knowledge can be flattened into chunks, embedded, and ranked by semantic similarity.

That works often enough to become a default, and badly enough to become a recurring production tax. Real organizations do not keep knowledge in one tidy corpus. They have documents, relational databases, RDF graphs, property graphs, logs, tickets, dashboards, schemas, APIs, and a decade of “temporary” systems that became load-bearing. OmniRetrieval is interesting because it pushes against the vector-soup reflex. Instead of forcing every source into one embedding interface, it routes a natural-language query to candidate knowledge bases, formulates native queries for each backend, executes them, and selects evidence across the returned results.

The benchmark is not small. OmniRetrieval spans 13 datasets and 309 distinct knowledge bases across four backend types: unstructured text corpora, relational databases, RDF knowledge graphs, and labeled property graphs. The dataset families include BEIR corpora such as NFCorpus, SciFact, FiQA, MS MARCO, FEVER, NQ, and HotpotQA; SQL datasets Spider and BIRD; RDF datasets LC-QuAD 2.0, QALD-10, and SimpleQuestions; and Text2Cypher. Models tested as orchestration backbones include GPT-5.4, Gemini-3.1 Pro, Sonnet-4.6, Qwen-3.5 27B, and Gemma-4 31B, with open models served locally through vLLM.

The headline numbers are modest but meaningful. Average source-selection accuracy rises from 61.65 for KB Routing to 65.71 for OmniRetrieval, with Oracle at 100. Average LLM-as-a-Judge answer correctness rises from 57.99 to 65.88, while Oracle reaches 74.55. The paper notes that the gap to Oracle narrows as the pipeline moves from source selection to judged answer quality — roughly 34.27 points at source selection, 17.51 at retrieval, and 8.67 at final judgment — which suggests evidence selection can recover when routing is imperfect.

Structure is not noise

The useful critique here is that embeddings are not a universal solvent. SQL is not just text with commas. It encodes joins, aggregation, constraints, and typed columns. SPARQL carries graph semantics and ontology relationships. Cypher traverses property graphs. A document corpus supports passage retrieval. These sources have different affordances because they answer different kinds of questions. Flattening all of them into chunks can erase the exact structure that makes the source valuable.

If the question is “which customers opened more than three severity-one tickets after upgrading to plan X,” semantic similarity is the wrong primitive. You want schema understanding, filters, joins, and aggregation. If the question asks for relationships across entities in a knowledge graph, you want traversal. If the question asks for policy language, text retrieval may be correct. A serious retrieval system should decide which interface matches the question, not pretend every backend is a paragraph waiting to be embedded.

This is where OmniRetrieval fits the agent conversation. Retrieval is tool use. The model is not merely answering; it is selecting sources, writing SQL/SPARQL/Cypher or search queries, executing them, interpreting results, and consolidating evidence. That means the intermediate contracts matter. A final answer score alone cannot tell you whether the system chose the right source, wrote a safe query, used stale schema information, or got lucky with a semantically equivalent source.

The enterprise version is governance-heavy

The anti-vector-soup stance is right, but it is not free. Native retrieval increases operational surface area. SQL queries need permission boundaries, row-level security, execution limits, and audit logs. Graph queries can be expensive or reveal relationships the user should not see. Schema prompts can leak sensitive structure. Generated queries can be wrong in ways that a top-k chunk retrieval never is. A vector index can be sloppy and merely disappointing; a generated query against a live database can be dangerous.

That does not argue against native retrieval. It argues for treating it like production infrastructure. Every generated query should be logged with source, schema version, user context, policy state, execution time, result size, and final evidence usage. Dangerous query classes should be blocked or routed through read-only views. High-cost traversals should have budgets. If a model is allowed to query multiple backends, the runtime should record which sources were considered, rejected, executed, and cited. “The model said so” is not a lineage system.

Practitioners should also resist the opposite overcorrection: not every source deserves native access. Some knowledge really is document-shaped. Some tables are better exposed through curated views. Some graphs are too messy for direct agent queries. The right architecture classifies sources by structure, sensitivity, query semantics, and operational risk. Use embeddings where similarity search is the right tool. Use native queries where structure is the point.

Local backbones make this more than a closed-model demo

OmniRetrieval’s inclusion of Qwen-3.5 27B and Gemma-4 31B served locally through vLLM is a small but important detail. Retrieval orchestration may become one of the places where local or BYOK models are useful even when the final answer uses a stronger closed model. Source routing, schema inspection, query drafting, and evidence selection are repeatable internal operations. Some teams will want those steps close to their data plane for cost, latency, privacy, or compliance reasons.

That does not mean open models automatically win. It means retrieval architecture should be decomposed enough to route different substeps to different models. A local model might classify the source and draft a safe query template; a stronger model might handle ambiguous synthesis; a deterministic checker might validate SQL shape; a policy engine might redact results before answer generation. The more heterogeneous the knowledge layer, the less credible the one-model, one-vector-index story becomes.

The project was early but tangible during research: MIT licensed, created on 2026-05-27, pushed 2026-05-28, updated 2026-05-29, and sitting at 14 GitHub stars. There was no HN discussion, while Hugging Face Daily Papers showed 53 votes, the strongest public signal among the arXiv candidates in the batch. That tracks. Retrieval infrastructure is not mass-market spectacle. It is the thing teams notice when their “enterprise brain” cannot answer a question because the answer lives across a table, a graph, and a PDF.

The editorial take: unified retrieval should mean one user experience over many native backends, not flattening every useful structure until the model can no longer exploit it. Vector search is a good tool. It is not a data strategy.

Sources: arXiv, OmniRetrieval GitHub repository, BEIR benchmark, Spider text-to-SQL benchmark.