The Lakehouse began as a revolt.
Traditional data warehouses gave us speed—but only through rigidity: fixed schemas, costly hardware, and inflexible scale-up pricing. Data lakes countered with openness and cheap, limitless storage—at the expense of performance and manageability. For a while, we had to choose between control and chaos.
Then open table formats, Delta, Iceberg, and Hudi, have offered a third path: Lakehouse architectures that could run warehouse-grade analytics on top of object storage. This was a turning point. Analysts gained schema evolution without ceremony, engineers stopped tuning clusters, and every data engine from Spark to DuckDB could plug into the same lake. For a moment, it felt like the future had arrived.
But the real test was never onboarding. It was staying fast at scale.
At Qbeast, we’ve worked with dozens of teams scaling Lakehouses in the real world. They all hit the same pattern: flexibility starts strong, then quietly erodes performance. Ingestion pipelines create thousands of tiny, unsorted files. Statistics go stale. Filters that once skipped 90% of files now scan almost every file. Clustering jobs fall behind; compactions overload compute; partitioning strategies become unmanageable across high-cardinality keys.
Lakehouses today ingest data with no concern for query patterns. As long as the schema is valid and the file lands, the write is a success. But this “append first, optimize later” model drifts over time, making every filter, join, and aggregate more expensive.
Most teams resort to brute-force, using hourly Z-order clustering, hot/cold table pairs, custom UDFs and materialized views. These techniques can work—for a while. But they introduce complexity, duplicate storage, and still can’t keep up with fast-changing workloads
It’s no surprise that Lakehouse is taking over for traditional warehouses as the default choice, built around its key advantage: flexibility through brutal simplicity.
You can write anything, read with anything, but that power comes at a cost.
At Qbeast, we believe that data layout is a first-class infrastructure concern. And that layout should adapt automatically—not through batch jobs or expensive rewrites, but intelligently and dynamically.
Inspired by Cubist art, we don’t see data as a linear stream to be sorted once and scanned forever. We see it as a multidimensional canvas: timestamp, user, region, event type—all forming the axes of a space to be navigated, not flattened.
Our multi-dimensional indexing system lays out data as a spatial tree. Hot regions split recursively. Cold ones stay broad. Each new record lands in the right “cube”, incrementally preserving locality. The result? Queries only touch the slices of the Lakehouse that matter.
The gain is not subtle. In a real-world telemetry workload, our approach reduced file scans by 99.9% and delivered a 6.3× speedup—without adding compaction jobs or disrupting ingestion.
A single table can serve both real-time and historical queries—no splitting, copying, or stitching required.
Query planning becomes more efficient, with smarter pruning and fewer full scans.
Qbeast brings intelligence to layout—so performance scales without duct tape.
Qbeast isn’t a new engine or a closed runtime. We extend open table formats like Delta Lake, Iceberg, and Hudi by embedding lightweight spatial metadata alongside existing structures. Your data stays where it is. Your tools don’t change. But your queries run faster—often dramatically so.
We handle high-cardinality columns. We support concurrent writers. We maintain layout quality without global locks or coordination overhead. And we enable efficient sampling for approximate queries with minimal extra cost.
At the core of Qbeast is an n-dimensional tree-based index that maps the distribution of values across selected columns. This index captures the density and structure of the data space and is persisted as metadata alongside the table’s open-format metadata. When the table changes—such as during an update or insert—Qbeast updates the index and inserts new records directly into the appropriate spatial region, ensuring that the physical layout of the table remains well-organized over time. This incremental layout preservation dramatically reduces the need for expensive post-hoc clustering and efficiency of compaction jobs.
Importantly, this index is not queried directly. A Qbeast-indexed Delta Lake table, for example, can be queried using any standard Delta-compatible engine. The benefit comes from the physical layout optimization: by organizing records spatially at insert time, Qbeast enables existing engines to take better advantage of built-in optimizations such as:
No engine modification or runtime coordination is required.
There is one exception where Qbeast can use the index at query time: sampling. For workloads that support approximate results, Qbeast enables efficient sampling, allowing for the identification of the smallest set of cubes (i.e. data-files) that satisfies the requested sampling fraction.
While the sample may include more records than strictly required, it is often one or two orders of magnitude smaller than scanning the full dataset—enabling fast, low-cost exploration or model training over representative subsets.
Now that we’ve outlined the theory, how does Qbeast perform in practice? In this section, we explore a real-world telemetry workload to illustrate the benefits of a multi-dimensional spatial indexing model.
We worked with a customer managing large-scale time-series measurements— ingested continuously from connected devices. The dataset consisted of roughly 830 GiB of telemetry data stored in a Lakehouse table. We compared two layout strategies:
For both configurations, the table was indexed or sorted on three columns: timestamp, device_id, and measurement_type.
We profiled three representative queries across both layouts. As shown in Figure 3, Qbeast delivered speedups ranging from 1.8× to 6.3×, depending on the filter selectivity and data access patterns.
Drilling into Query 3, the performance gain came primarily from file pruning:
With sort-based clustering, the engine was able to skip ~60% of files, reading 333 GiB.
With Qbeast indexing, the same query skipped 99.9% of files, reading only 26 GiB.
This highlights how a multi-dimensional layout allows for far more effective pruning across complex filter predicates—not just along a single sort axis.
Faster queries don’t just improve responsiveness—they directly reduce compute costs. In the case of Query 1, a relatively modest 1.8× speedup led to a 55% reduction in executor runtime. This is due to the cumulative effect of:
These savings compound across production workloads, delivering meaningful reductions in compute spend over time.
One of the most surprising facts, is that these gains in query performance don’t come at the cost of write-time complexity. In fact, Qbeast indexing is more efficient than sort-based clustering during data ingestion. As shown in Figure 4, we measured the compute time required for different write strategies:
A standard append (no optimization) was the baseline.
Qbeast append operations took just 30% more compute time than the baseline.
By contrast, a traditional append + optimize using sort-based clustering took 60% more time than the baseline, or 45% more than Qbeast.
Figure 4: Append + Optimize Execution Time
This shows that Qbeast provides maintainable layout quality without the ordering burden typical of traditional clustering strategies—making layout optimization at ingestion and frequent updates at scale practical.
Indexing isn't new. But it’s been missing in the cloud-native stack. As companies build larger data platforms and AI-powered applications—especially Retrieval Augmented Generation (RAG) systems—every unnecessary scan wastes latency, money, and model tokens. Model training and fine-tuning are excellent use-cases for Qbeast sampling saving precious GPU cycles, and reducing training time.
By pruning aggressively at the layout level, Qbeast shrinks candidate sets before they reach the engine. Your LLM gets cleaner context; your costs stay low; your data remains governed and secure.
Storage is cheap. Compute is elastic. Query engines are interchangeable. The last strategic lever left in modern data infrastructure is layout intelligence—ensuring that only the bytes a query actually needs are read.
Qbeast reintroduces indexing into the Lakehouse with a design that adapts to data density, scales across petabytes, and remains compatible with open standards and concurrent writers. It’s not just a faster way to run dashboards, it’s a structural upgrade for real-time analytics, AI pipelines, and high-concurrency workloads.
In that light, a 99%-skip index is not just an optimization—it’s the next foundational building block. Just as columnar storage changed the game in the early 2010s, adaptive indexing will define the next decade of analytics and AI infrastructure.
The Lakehouse unlocked adoption.
Qbeast solves performance.
In a follow-up article, we'll be diving deeper into the technical details of how Qbeast can deliver these results for any Lakehouse. In the meantime, don't hesitate to reach out to us for more information at info@qbeast.io