Automatic Bbox Metadata Pruning for GeoParquet Queries #652

aborruso · 2025-08-06T10:35:26Z

aborruso
Aug 6, 2025

Hello,
first off, thank you for this incredible tool and extension. I'm using it for large-scale geospatial analysis with a partitioned GeoParquet dataset, and it's been a game-changer.

It's probably already possible, but I don't know how to do it.

The Scenario

I'm working with a large dataset of cadastral parcels for Italy, stored as a hive-partitioned GeoParquet dataset. A common task is to find which specific parcel contains a given point (longitude, latitude).

The "Naive" Query

My initial approach was a straightforward spatial query. While DuckDB this query has to scan all the data, and on the full national dataset, it takes about 35 seconds.

INSTALL spatial; LOAD spatial;
 SELECT
   p.COMUNE,
   p.NATIONALCADASTRALREFERENCE
 FROM 'data/processed/parquet/ple/**/*.parquet' AS p
 WHERE
   ST_Contains(p.geom, ST_Transform(ST_Point(12.315, 42.741), 'EPSG:4326', 'EPSG:6706', always_xy := true));

The Manually Optimized Approach

I realized I could dramatically speed this up by manually pre-filtering the files using the geo metadata stored in the Parquet files, which contains the bounding box (bbox).

Step 1: Identify candidate files by querying metadata. This is extremely fast.

-- This takes < 0.5 seconds
WITH
  params AS (SELECT 12.315 AS x, 42.741 AS y),
  meta AS (
    SELECT file_name, decode(value)::VARCHAR AS js
    FROM parquet_kv_metadata('data/processed/parquet/ple/**/*.parquet')
    WHERE key = 'geo'
  ),
  extracted AS (
    SELECT file_name, regexp_extract(js, '\"bbox\"\\s*:\\s*\\[([^\\]]+)\\]', 1) AS bbox_str
    FROM meta
  ),
  split_vals AS (
    SELECT
      file_name,
      cast(split_part(bbox_str, ',', 1) AS DOUBLE) AS minx,
      cast(split_part(bbox_str, ',', 2) AS DOUBLE) AS miny,
      cast(split_part(bbox_str, ',', 3) AS DOUBLE) AS maxx,
      cast(split_part(bbox_str, ',', 4) AS DOUBLE) AS maxy
    FROM extracted WHERE bbox_str IS NOT NULL
  )
SELECT file_name
FROM split_vals, params
WHERE minx <= x AND maxx >= x AND miny <= y AND maxy >= y;

Step 2: Run the spatial query only on the candidate files.

By plugging the handful of files from Step 1 into the WHERE clause, the main query becomes incredibly fast.

-- This takes ~1.5 seconds
SELECT
  p.COMUNE,
  p.NATIONALCADASTRALREFERENCE
FROM 'data/processed/parquet/ple/**/*.parquet' AS p
WHERE
  p.filename IN (...) -- List of files from Step 1
  AND ST_Contains(p.geom, ST_Transform(ST_Point(12.315, 42.741), 'EPSG:4326', 'EPSG:6706', always_xy := true));

This two-step process reduces the total query time from ~35 seconds to less than 2 seconds — a massive improvement of nearly 18x.

Feature Idea

It would be a fantastic feature if DuckDB's query planner could perform this optimization automatically. The ideal behavior would be for the query planner, when it sees a spatial predicate like ST_Contains(geom, ...) on a Parquet dataset, to automatically perform this "metadata pruning" step.

It would inspect the geo key in the Parquet metadata, parse the bbox, and filter out any files whose bounding box doesn't contain the point of interest before scanning the actual geometry columns in those files.

This would make DuckDB's geospatial capabilities even more powerful and intuitive, essentially providing a built-in spatial index that leverages the metadata already present in the GeoParquet standard.

Thanks for your consideration and for all your hard work on this project!

Should I proceed with posting this to the duckdb/duckdb-spatial discussions under the "Ideas" category?

cboettig · 2025-08-06T16:37:41Z

cboettig
Aug 6, 2025

This is great, also run into very similar situations, and being able to handle this out-of-the box would be cool. Guessing it would be best as an opt-in option though, as the metadata extraction can add significant overhead with many partitions in queries that wouldn't benefit from it.

@aborruso just wanted to mention an alternative strategy we've been using in cases where you have control over the target data partitioning: We have been using the duckdb h3 extension to partition the geoparquet by h3-cell (at appropriate zoom level). This also reduces the number of files we have to touch, but without needing to look in the metadata.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automatic Bbox Metadata Pruning for GeoParquet Queries #652

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Automatic Bbox Metadata Pruning for GeoParquet Queries #652

Uh oh!

aborruso Aug 6, 2025

The Scenario

The "Naive" Query

The Manually Optimized Approach

Feature Idea

Replies: 1 comment

Uh oh!

cboettig Aug 6, 2025

aborruso
Aug 6, 2025

cboettig
Aug 6, 2025