-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Closed
Copy link
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
In a Nested Loop Join, the selectivity metric is defined as output_rows / possible_combinations. I believe this provides useful application-level insight.
Example
In the below query (run in datafusion-cli)
> set datafusion.explain.analyze_level = summary;
0 row(s) fetched.
Elapsed 0.000 seconds.
> explain analyze select *
from generate_series(10) as t1(a)
join generate_series(10) as t2(b)
on (t1.a + t2.b) = 20;
> explain analyze select *
from generate_series(10) as t1(a)
join generate_series(10) as t2(b)
on (t1.a + t2.b) = 20;
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | NestedLoopJoinExec: join_type=Inner, filter=a@0 + b@1 = 20, metrics=[output_rows=1, elapsed_compute=709.665µs, output_bytes=128.0 KB] |
| | ProjectionExec: expr=[value@0 as a], metrics=[output_rows=11, elapsed_compute=1.958µs, output_bytes=64.0 KB] |
| | LazyMemoryExec: partitions=1, batch_generators=[generate_series: start=0, end=10, batch_size=8192], metrics=[output_rows=11, elapsed_compute=13.083µs, output_bytes=64.0 KB] |
| | RepartitionExec: partitioning=RoundRobinBatch(14), input_partitions=1, metrics=[] |
| | ProjectionExec: expr=[value@0 as b], metrics=[output_rows=11, elapsed_compute=584ns, output_bytes=64.0 KB] |
| | LazyMemoryExec: partitions=1, batch_generators=[generate_series: start=0, end=10, batch_size=8192], metrics=[output_rows=11, elapsed_compute=14.875µs, output_bytes=64.0 KB] |
| | |
+-------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.003 seconds.
Logically NLJ will do
for left_row in t1:
for right_row in t2:
if left_row.a + right_row.b = 20:
output(left_row, right_row)
The selectivity will be calculated as output_size / (left_size * right_size) => 1 / (10 * 10) => 1%
Describe the solution you'd like
Add a selectivity metrics to NestedLoopJoinExec.
Reference PR: #18406
Describe alternatives you've considered
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request