how to run tpch benchmark datafusion #16598

l1t1 · 2025-06-28T04:05:32Z

l1t1
Jun 28, 2025

I build it from source. when run
/par/datafusion-main/target/release/tpch benchmark datafusion --path /par/tpch/sf4-parquet --format parquet
I got following error message

Running benchmarks with the following options: RunOpt { query: None, common: CommonOpt { iterations: 3, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/par/tpch/sf4-parquet", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: false, prefer_hash_join: true, sorted: false }
Error: ObjectStore(NotFound { path: "/par/tpch/sf4-parquet/part", source: Os { code: 2, kind: NotFound, message: "No such file or directory" } })

the data files are as following

ls -l /par/tpch/sf4-parquet
total 1456872
-rw-r--r-- 1 root root  53985688 Jun 27 05:54 customer.parquet
-rw-r--r-- 1 root root 975758540 Jun 27 05:54 lineitem.parquet
-rw-r--r-- 1 root root      2966 Jun 27 05:54 nation.parquet
-rw-r--r-- 1 root root 262169281 Jun 27 05:54 orders.parquet
-rw-r--r-- 1 root root  27412078 Jun 27 05:54 part.parquet
-rw-r--r-- 1 root root 168886901 Jun 27 05:54 partsupp.parquet
-rw-r--r-- 1 root root      1474 Jun 27 05:54 region.parquet
-rw-r--r-- 1 root root   3581714 Jun 27 05:54 supplier.parquet

Answered by zhuqi-lucas

Jun 28, 2025

It isn’t a bug in DataFusion so much as in how the TPCH benchmark runner expects your data laid out. By default it will look under your --path for one directory per table (named exactly after the table), and then inside that directory expect one or more Parquet files. What you have today is a flat directory of files:

/par/tpch/sf4-parquet/
├─ customer.parquet
├─ lineitem.parquet
├─ nation.parquet
├─ orders.parquet
├─ part.parquet
├─ partsupp.parquet
├─ region.parquet
└─ supplier.parquet

When it tries to read table part it literally does a list() on /par/tpch/sf4-parquet/part (i.e. a directory), which doesn’t exist, hence the “NotFound … path: …/part” error.

A easy way to fix it:

cd /par/t…

View full answer

zhuqi-lucas · 2025-06-28T06:53:46Z

zhuqi-lucas
Jun 28, 2025
Collaborator

It isn’t a bug in DataFusion so much as in how the TPCH benchmark runner expects your data laid out. By default it will look under your --path for one directory per table (named exactly after the table), and then inside that directory expect one or more Parquet files. What you have today is a flat directory of files:

/par/tpch/sf4-parquet/
├─ customer.parquet
├─ lineitem.parquet
├─ nation.parquet
├─ orders.parquet
├─ part.parquet
├─ partsupp.parquet
├─ region.parquet
└─ supplier.parquet

When it tries to read table part it literally does a list() on /par/tpch/sf4-parquet/part (i.e. a directory), which doesn’t exist, hence the “NotFound … path: …/part” error.

A easy way to fix it:

cd /par/tpch/sf4-parquet

for tbl in customer lineitem nation orders part partsupp region supplier; do
  mkdir -p "$tbl"
  mv "${tbl}.parquet" "$tbl/"
done

Or you can using datafusion command to generate the tpch data:

https://github.com/apache/datafusion/blob/main/benchmarks/README.md

./bench.sh data tpch

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

how to run tpch benchmark datafusion #16598

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

how to run tpch benchmark datafusion #16598

Uh oh!

l1t1 Jun 28, 2025

Replies: 1 comment

Uh oh!

zhuqi-lucas Jun 28, 2025 Collaborator

l1t1
Jun 28, 2025

zhuqi-lucas
Jun 28, 2025
Collaborator