Skip to content

Support selectivity metrics (for Explain Analyze) in Nested Loop Join #18407

@2010YOUY01

Description

@2010YOUY01

Is your feature request related to a problem or challenge?

In a Nested Loop Join, the selectivity metric is defined as output_rows / possible_combinations. I believe this provides useful application-level insight.

Example

In the below query (run in datafusion-cli)

> set datafusion.explain.analyze_level = summary;
0 row(s) fetched.
Elapsed 0.000 seconds.

> explain analyze select *
from generate_series(10) as t1(a)
join generate_series(10) as t2(b)
on (t1.a + t2.b) = 20;

> explain analyze select *
from generate_series(10) as t1(a)
join generate_series(10) as t2(b)
on (t1.a + t2.b) = 20;
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type         | plan                                                                                                                                                                               |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | NestedLoopJoinExec: join_type=Inner, filter=a@0 + b@1 = 20, metrics=[output_rows=1, elapsed_compute=709.665µs, output_bytes=128.0 KB]                                              |
|                   |   ProjectionExec: expr=[value@0 as a], metrics=[output_rows=11, elapsed_compute=1.958µs, output_bytes=64.0 KB]                                                                     |
|                   |     LazyMemoryExec: partitions=1, batch_generators=[generate_series: start=0, end=10, batch_size=8192], metrics=[output_rows=11, elapsed_compute=13.083µs, output_bytes=64.0 KB]   |
|                   |   RepartitionExec: partitioning=RoundRobinBatch(14), input_partitions=1, metrics=[]                                                                                                |
|                   |     ProjectionExec: expr=[value@0 as b], metrics=[output_rows=11, elapsed_compute=584ns, output_bytes=64.0 KB]                                                                     |
|                   |       LazyMemoryExec: partitions=1, batch_generators=[generate_series: start=0, end=10, batch_size=8192], metrics=[output_rows=11, elapsed_compute=14.875µs, output_bytes=64.0 KB] |
|                   |                                                                                                                                                                                    |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.003 seconds.

Logically NLJ will do

for left_row in t1:
    for right_row in t2:
        if left_row.a + right_row.b = 20:
            output(left_row, right_row)

The selectivity will be calculated as output_size / (left_size * right_size) => 1 / (10 * 10) => 1%

Describe the solution you'd like

Add a selectivity metrics to NestedLoopJoinExec.
Reference PR: #18406

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions