Replies: 1 comment
-
This is great, also run into very similar situations, and being able to handle this out-of-the box would be cool. Guessing it would be best as an opt-in option though, as the metadata extraction can add significant overhead with many partitions in queries that wouldn't benefit from it. @aborruso just wanted to mention an alternative strategy we've been using in cases where you have control over the target data partitioning: We have been using the duckdb h3 extension to partition the geoparquet by h3-cell (at appropriate zoom level). This also reduces the number of files we have to touch, but without needing to look in the metadata. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
first off, thank you for this incredible tool and extension. I'm using it for large-scale geospatial analysis with a partitioned GeoParquet dataset, and it's been a game-changer.
It's probably already possible, but I don't know how to do it.
The Scenario
I'm working with a large dataset of cadastral parcels for Italy, stored as a hive-partitioned GeoParquet dataset. A common task is to find which specific parcel contains a given point (longitude, latitude).
The "Naive" Query
My initial approach was a straightforward spatial query. While DuckDB this query has to scan all the data, and on the full national dataset, it takes about 35 seconds.
The Manually Optimized Approach
I realized I could dramatically speed this up by manually pre-filtering the files using the geo metadata stored in the Parquet files, which contains the bounding box (bbox).
Step 1: Identify candidate files by querying metadata. This is extremely fast.
Step 2: Run the spatial query only on the candidate files.
By plugging the handful of files from Step 1 into the WHERE clause, the main query becomes incredibly fast.
This two-step process reduces the total query time from ~35 seconds to less than 2 seconds — a massive improvement of nearly 18x.
Feature Idea
It would be a fantastic feature if DuckDB's query planner could perform this optimization automatically. The ideal behavior would be for the query planner, when it sees a spatial predicate like ST_Contains(geom, ...) on a Parquet dataset, to automatically perform this "metadata pruning" step.
It would inspect the geo key in the Parquet metadata, parse the bbox, and filter out any files whose bounding box doesn't contain the point of interest before scanning the actual geometry columns in those files.
This would make DuckDB's geospatial capabilities even more powerful and intuitive, essentially providing a built-in spatial index that leverages the metadata already present in the GeoParquet standard.
Thanks for your consideration and for all your hard work on this project!
Should I proceed with posting this to the duckdb/duckdb-spatial discussions under the "Ideas" category?
Beta Was this translation helpful? Give feedback.
All reactions