Volcano Iterator
Navigating the Volcano: Efficient Query Processing in Graph Databases
The Volcano iterator model, a cornerstone of relational database query execution since its introduction in the early 1990s, operates on a simple yet powerful principle: each operator in a query plan acts as an iterator that produces a stream of tuples one at a time, invoked via methods like open()
, next()
, and close()
. This pipelined approach allows operators to lazily pull data from their children, enabling dynamic query plans and extensibility in systems like Volcano itself.
As explained in CMU 15-445 query execution notes and modern overviews, it transforms a query into a directed acyclic graph (DAG) where data flows upward from leaves (e.g., scans) to the root (output), processing tuples incrementally to avoid full materialization. Note: The “materialization model” discussed in the CMU notes refers to an inefficient approach that processes all input at once, distinct from the “materialization walk” paradigm proposed here, which lazily materializes only what’s needed by walking a generator tree.
In practice, this builds a flat list of tuples—essentially rows of values—through iterative calls. For instance, in the LadybugDB graph database, the physical operator hierarchy in physical_operator.h
relies on getNextTuple()
to fetch the next row from child operators, populating a ResultSet
(a container for query output) with ValueVector
s that store column-wise data chunks of up to 2048 elements. These vectors hold homogeneous types like integers or strings in byte arrays, augmented with null masks and auxiliary buffers for variable-length data, ensuring efficient, cache-friendly processing without redundant copies.
Similarly, Kùzu’s ResultSet extends this vectorized paradigm, aggregating DataChunk
s—bundles of aligned ValueVector
s sharing a SelectionState
for zero-copy filtering—into a columnar stream that implicitly represents the Cartesian product of intermediate results, streaming tuples lazily to minimize memory overhead. Graph databases like LadybugDB thus produce a flat list of tuples, denormalizing nested graph structures (e.g., paths or trees) into repeated rows, which explodes output size for complex traversals.
For those following similar work in other projects, ValueVector
is analogous to Datafusion’s open-variant and iterlib’s dynamic. Note that much of this code was written circa 2014 long before std::variant
was available. It’s not clear if any database uses std::variant
in a hot path, but iterlib::dynamic
was in use for a couple of years in a hot path.
A key innovation in query languages came with GraphQL, which decouples input from output paradigms by making the response shape predictable and isomorphic to the query itself. Unlike REST endpoints that return fixed JSON blobs, GraphQL’s schema-defined queries—e.g., { user(id: "1") { name friends { name } } }
—yield hierarchical JSON mirroring the nesting, allowing clients to request precisely shaped data without over- or under-fetching. This declarative structure, validated against the type system at runtime, ensures tools can introspect and predict responses, fostering efficient frontend integration. Reproducing this efficiency in a Volcano model, however, proves challenging: the iterator’s tuple-at-a-time pull forces full expansion of joins into flat lists early, materializing massive Cartesian products for graph queries with high fan-out (e.g., social network paths). Even with vectorization, downstream operators must process redundant data, bloating intermediate ResultSets and straining memory/CPU, especially for m-n relationships where output cardinality grows exponentially.
To mitigate this, innovations like factorized joins in Kùzu database represent a pivotal advance, compressing intermediate representations without full expansion. Traditional hash or nested-loop joins compute entire cross-products before filtering, but Kùzu’s factorized approach—leveraging recursive encodings for multi-valued attributes—stores shared prefixes once (e.g., a user’s details repeated across friends) and expands only on demand via novel operators like ASP-Join (Accumulate, Semi-probe, Probe). This reduces ResultSet size dramatically, even for large outputs, by avoiding redundant computations in aggregations or projections, while maintaining sequential scans and partial file avoidance for scalability on 100GB+ graphs. Benchmarks show it outperforming flat plans in baselines like Neo4j on cyclic/acyclic queries, aligning Volcano’s pipelining with graph-native compression.
Building further on this, a promising evolution materializes query results as trees of nested async generators, deferring expansion until consumption. In Python-based libraries like fquery, tests such as test_walk.py
and test_walk_obj.py
illustrate this: generators yield hierarchical structures lazily (e.g., a root node spawning async sub-generators for child traversals), enabling GraphQL-like nesting without upfront flattening. This is straightforward in scripting languages—using async def
and yield from
for coroutine trees—but extending factorized joins to arbitrary nested graph queries (e.g., variable-depth paths) remains elusive in core DBMS engines. Factorization excels at fixed-depth m-n joins but struggles with dynamic recursion, risking recomputation or schema mismatches in Volcano pipelines, highlighting the need for hybrid models blending vectorized iterators with generator hierarchies.