Specific Blogs

A Big Bang Moment

The Lakehouse began as a revolt. 

Traditional data warehouses gave us speed—but only through rigidity: fixed schemas, costly hardware, and inflexible scale-up pricing. Data lakes countered with openness and cheap, limitless storage—at the expense of performance and manageability. For a while, we had to choose between control and chaos. 

Then open table formats, Delta, Iceberg, and Hudi, have offered a third path: Lakehouse architectures that could run warehouse-grade analytics on top of object storage. This was a turning point. Analysts gained schema evolution without ceremony, engineers stopped tuning clusters, and every data engine from Spark to DuckDB could plug into the same lake. For a moment, it felt like the future had arrived. 

But the real test was never onboarding. It was staying fast at scale.

‍

The Hidden Cost of Flexibility

At Qbeast, we’ve worked with dozens of teams scaling Lakehouses in the real world. They all hit the same pattern: flexibility starts strong, then quietly erodes performance. Ingestion pipelines create thousands of tiny, unsorted files. Statistics go stale. Filters that once skipped 90% of files now scan almost every file. Clustering jobs fall behind; compactions overload compute; partitioning strategies become unmanageable across high-cardinality keys.

The problem isn’t with the engine or the storage. It’s the layout.

Lakehouses today ingest data with no concern for query patterns. As long as the schema is valid and the file lands, the write is a success. But this “append first, optimize later” model drifts over time, making every filter, join, and aggregate more expensive.

Most teams resort to brute-force, using hourly Z-order clustering, hot/cold table pairs, custom UDFs and materialized views. These techniques can work—for a while. But they introduce complexity, duplicate storage, and still can’t keep up with fast-changing workloads

It’s no surprise that Lakehouse is taking over for traditional warehouses as the default choice, built around its key advantage: flexibility through brutal simplicity.  

You can write anything, read with anything, but that power comes at a cost.

‍

The Qbeast Approach: Structure without Sacrifice

At Qbeast, we believe that data layout is a first-class infrastructure concern. And that layout should adapt automatically—not through batch jobs or expensive rewrites, but intelligently and dynamically.

Inspired by Cubist art, we don’t see data as a linear stream to be sorted once and scanned forever. We see it as a multidimensional canvas: timestamp, user, region, event type—all forming the axes of a space to be navigated, not flattened.

Our multi-dimensional indexing system lays out data as a spatial tree. Hot regions split recursively. Cold ones stay broad. Each new record lands in the right “cube”, incrementally preserving locality. The result? Queries only touch the slices of the Lakehouse that matter.

The gain is not subtle. In a real-world telemetry workload, our approach reduced file scans by 99.9% and delivered a 6.3× speedup—without adding compaction jobs or disrupting ingestion.

A single table can serve both real-time and historical queries—no splitting, copying, or stitching required. 

Query planning becomes more efficient, with smarter pruning and fewer full scans.

Write amplification drops, because indexing happens incrementally—not through expensive sort-and-rewrite cycles.
Schema evolution stays simple, since layout and structure are decoupled.

Qbeast brings intelligence to layout—so performance scales without duct tape.

Compatible by Design

Qbeast isn’t a new engine or a closed runtime. We extend open table formats like Delta Lake, Iceberg, and Hudi by embedding lightweight spatial metadata alongside existing structures. Your data stays where it is. Your tools don’t change. But your queries run faster—often dramatically so.

We handle high-cardinality columns. We support concurrent writers. We maintain layout quality without global locks or coordination overhead. And we enable efficient sampling for approximate queries with minimal extra cost.

Tree-Based Spatial Indexing

At the core of Qbeast is an n-dimensional tree-based index that maps the distribution of values across selected columns. This index captures the density and structure of the data space and is persisted as metadata alongside the table’s open-format metadata. When the table changes—such as during an update or insert—Qbeast updates the index and inserts new records directly into the appropriate spatial region, ensuring that the physical layout of the table remains well-organized over time. This incremental layout preservation dramatically reduces the need for expensive post-hoc clustering and efficiency of compaction jobs.

‍

Transparent Read-Time Behavior

Importantly, this index is not queried directly. A Qbeast-indexed Delta Lake table, for example, can be queried using any standard Delta-compatible engine. The benefit comes from the physical layout optimization: by organizing records spatially at insert time, Qbeast enables existing engines to take better advantage of built-in optimizations such as:

Min-max pruning
Predicate pushdown 
*Storage Partition Joins

No engine modification or runtime coordination is required.

Sampling with the Index Tree

There is one exception where Qbeast can use the index at query time: sampling. For workloads that support approximate results, Qbeast enables efficient sampling, allowing for the identification of the smallest set of cubes (i.e. data-files) that satisfies the requested sampling fraction.

While the sample may include more records than strictly required, it is often one or two orders of magnitude smaller than scanning the full dataset—enabling fast, low-cost exploration or model training over representative subsets.

‍

Real-World Results

Now that we’ve outlined the theory, how does Qbeast perform in practice? In this section, we explore a real-world telemetry workload to illustrate the benefits of a multi-dimensional spatial indexing model.

Use Case: Time-Series Telemetry at Scale

We worked with a customer managing large-scale time-series measurements— ingested continuously from connected devices. The dataset consisted of roughly 830 GiB of telemetry data stored in a Lakehouse table. We compared two layout strategies:

Sort-based clustering
Qbeast multi-dimensional indexing

For both configurations, the table was indexed or sorted on three columns: timestamp, device_id, and measurement_type.

‍

Query Performance Improvements

We profiled three representative queries across both layouts. As shown in Figure 3, Qbeast delivered speedups ranging from 1.8× to 6.3×, depending on the filter selectivity and data access patterns. 

Query 1: Moderate filter on timestamp + device_id → 1.8× faster
Query 2: Multi-device filter across time range → ~4× faster
Query 3: Highly selective filter on all three columns → 6.3× faster

How does it work? File Pruning Efficiency

Drilling into Query 3, the performance gain came primarily from file pruning: 
‍
With sort-based clustering, the engine was able to skip ~60% of files, reading 333 GiB.
With Qbeast indexing, the same query skipped 99.9% of files, reading only 26 GiB.

This highlights how a multi-dimensional layout allows for far more effective pruning across complex filter predicates—not just along a single sort axis.

Translating Performance into Cost Savings

Faster queries don’t just improve responsiveness—they directly reduce compute costs. In the case of Query 1, a relatively modest 1.8× speedup led to a 55% reduction in executor runtime. This is due to the cumulative effect of:

Reduced I/O
Lower memory pressure
Shorter task duration and fewer speculative retries

These savings compound across production workloads, delivering meaningful reductions in compute spend over time.

Write Efficiency Without Trade-Offs

One of the most surprising facts, is that these gains in query performance don’t come at the cost of write-time complexity. In fact, Qbeast indexing is more efficient than sort-based clustering during data ingestion. As shown in Figure 4, we measured the compute time required for different write strategies: 
‍
A standard append (no optimization) was the baseline.

Qbeast append operations took just 30% more compute time than the baseline.

By contrast, a traditional append + optimize using sort-based clustering took 60% more time than the baseline, or 45% more than Qbeast.

‍

Figure 4: Append + Optimize Execution Time

This shows that Qbeast provides maintainable layout quality without the ordering burden typical of traditional clustering strategies—making layout optimization at ingestion and frequent updates at scale practical.

From Dashboards to AI: Why This Matters Now

Indexing isn't new. But it’s been missing in the cloud-native stack. As companies build larger data platforms and AI-powered applications—especially Retrieval Augmented Generation (RAG) systems—every unnecessary scan wastes latency, money, and model tokens. Model training and fine-tuning are excellent use-cases for Qbeast sampling saving precious GPU cycles, and reducing training time.

By pruning aggressively at the layout level, Qbeast shrinks candidate sets before they reach the engine. Your LLM gets cleaner context; your costs stay low; your data remains governed and secure.

‍

The New Foundation

Storage is cheap. Compute is elastic. Query engines are interchangeable. The last strategic lever left in modern data infrastructure is layout intelligence—ensuring that only the bytes a query actually needs are read.

Qbeast reintroduces indexing into the Lakehouse with a design that adapts to data density, scales across petabytes, and remains compatible with open standards and concurrent writers. It’s not just a faster way to run dashboards, it’s a structural upgrade for real-time analytics, AI pipelines, and high-concurrency workloads.

In that light, a 99%-skip index is not just an optimization—it’s the next foundational building block. Just as columnar storage changed the game in the early 2010s, adaptive indexing will define the next decade of analytics and AI infrastructure.

The Lakehouse unlocked adoption.

Qbeast solves performance.

In a follow-up article, we'll be diving deeper into the technical details of how Qbeast can deliver these results for any Lakehouse. In the meantime, don't hesitate to reach out to us for more information at info@qbeast.io

From Chaos to Canvas: Repainting the Lakehouse with Multidimensional Indexing