how to run tpch benchmark datafusion #16598
-
I build it from source. when run Running benchmarks with the following options: RunOpt { query: None, common: CommonOpt { iterations: 3, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/par/tpch/sf4-parquet", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: false, prefer_hash_join: true, sorted: false }
Error: ObjectStore(NotFound { path: "/par/tpch/sf4-parquet/part", source: Os { code: 2, kind: NotFound, message: "No such file or directory" } }) the data files are as following ls -l /par/tpch/sf4-parquet
total 1456872
-rw-r--r-- 1 root root 53985688 Jun 27 05:54 customer.parquet
-rw-r--r-- 1 root root 975758540 Jun 27 05:54 lineitem.parquet
-rw-r--r-- 1 root root 2966 Jun 27 05:54 nation.parquet
-rw-r--r-- 1 root root 262169281 Jun 27 05:54 orders.parquet
-rw-r--r-- 1 root root 27412078 Jun 27 05:54 part.parquet
-rw-r--r-- 1 root root 168886901 Jun 27 05:54 partsupp.parquet
-rw-r--r-- 1 root root 1474 Jun 27 05:54 region.parquet
-rw-r--r-- 1 root root 3581714 Jun 27 05:54 supplier.parquet |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
It isn’t a bug in DataFusion so much as in how the TPCH benchmark runner expects your data laid out. By default it will look under your --path for one directory per table (named exactly after the table), and then inside that directory expect one or more Parquet files. What you have today is a flat directory of files: /par/tpch/sf4-parquet/
├─ customer.parquet
├─ lineitem.parquet
├─ nation.parquet
├─ orders.parquet
├─ part.parquet
├─ partsupp.parquet
├─ region.parquet
└─ supplier.parquet When it tries to read table part it literally does a list() on /par/tpch/sf4-parquet/part (i.e. a directory), which doesn’t exist, hence the “NotFound … path: …/part” error. A easy way to fix it: cd /par/tpch/sf4-parquet
for tbl in customer lineitem nation orders part partsupp region supplier; do
mkdir -p "$tbl"
mv "${tbl}.parquet" "$tbl/"
done Or you can using datafusion command to generate the tpch data: https://github.com/apache/datafusion/blob/main/benchmarks/README.md ./bench.sh data tpch |
Beta Was this translation helpful? Give feedback.
It isn’t a bug in DataFusion so much as in how the TPCH benchmark runner expects your data laid out. By default it will look under your --path for one directory per table (named exactly after the table), and then inside that directory expect one or more Parquet files. What you have today is a flat directory of files:
When it tries to read table part it literally does a list() on /par/tpch/sf4-parquet/part (i.e. a directory), which doesn’t exist, hence the “NotFound … path: …/part” error.
A easy way to fix it: