Skip to content

Spatial joins #113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 23, 2024
30 changes: 27 additions & 3 deletions docs/src/tutorials/spatial_joins.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# Spatial joins

Spatial joins are joins which are based not on equality, but on some predicate ``p(x, y)``, which takes two geometries, and returns a value of either `true` or `false`. For geometries, the [`DE-9IM`](https://en.wikipedia.org/wiki/DE-9IM) spatial relationship model is used to determine the spatial relationship between two geometries.
Spatial joins are [table joins](https://www.geeksforgeeks.org/sql-join-set-1-inner-left-right-and-full-joins/) which are based not on equality, but on some predicate ``p(x, y)``, which takes two geometries, and returns a value of either `true` or `false`. For geometries, the [`DE-9IM`](https://en.wikipedia.org/wiki/DE-9IM) spatial relationship model is used to determine the spatial relationship between two geometries.

Spatial joins can be done between any geometry types (from geometrycollections to points), just as geometrical predicates can be evaluated on any geometries.

In this tutorial, we will show how to perform a spatial join on first a toy dataset and then two Natural Earth datasets, to show how this can be used in the real world.

In order to perform the spatial join, we use [FlexiJoins.jl](https://github.com/JuliaAPlavin/FlexiJoins.jl) to perform the join, specifically using its `by_pred` joining method. This allows the user to specify a predicate in the following manner:
In order to perform the spatial join, we use **[FlexiJoins.jl](https://github.com/JuliaAPlavin/FlexiJoins.jl)** to perform the join, specifically using its `by_pred` joining method. This allows the user to specify a predicate in the following manner:
```julia
[inner/left/right/outer/...]join((table1, table1),
by_pred(:table1_column, predicate_function, :table2_column) # & add other conditions here
Expand Down Expand Up @@ -108,4 +110,26 @@ innerjoin((state_compact_df, view(country_df, 1:1, :)), by_pred(:geom, GO.withi
```

!!! warning
This is how you would do this, but it doesn't work yet, since the GeometryOps predicates are quite slow on large polygons. If you try this, the code will continue to run for a very, very long time (it took 12 hours on my laptop, but with minimal CPU usage).
This is how you would do this, but it doesn't work yet, since the GeometryOps predicates are quite slow on large polygons. If you try this, the code will continue to run for a very, very long time (it took 12 hours on my laptop, but with minimal CPU usage).

## Enabling custom predicates

In case you want to use a custom predicate, you only need to define a method to tell FlexiJoins how to use it.

For example, let's suppose you wanted to perform a spatial join on geometries which are some distance away from each other:

```julia
my_predicate_function = <(5) ∘ abs ∘ GO.distance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is an actual distance, it's probably already supported by FlexiJoins :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is supported by Distances.jl, and there are a bunch of other GO functions one might want to use :D - for example, testing whether centroids are close to each other, or something. So I figured a general approach would be best.

Just out of curiosity, is there a reason that NestedLoopFast isn't supported by default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, testing whether centroids are close to each other

Isn't it just by_distance(x -> centroid(x.geometry), Euclidean(), <=(3))? Or whatever other distance you need instead of Euclidean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is supported by Distances.jl

Wonder why is that the case? Does the function break some distance properties?

Copy link
Contributor

@aplavin aplavin Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason that NestedLoopFast isn't supported by default?

I consider falling back to n^2 join without user explicitly requesting it is a footgun.
NestedLoopFast is really intended for cheap filtering operations on top of the "main" join predicate. Such as NotSame in FlexiJoins itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For centroids yes, but if computing the distance between geometries (basically distance between closest linesegments) then that's not a Distances.jl thing. The centroid comparison would be that though, and I should probably add an example of that syntax to the docs as well!

```

You would need to define `FlexiJoins.supports_mode` on your predicate:

```julia{3}
FlexiJoins.supports_mode(
::FlexiJoins.Mode.NestedLoopFast,
::FlexiJoins.ByPred{typeof(my_predicate_function)},
datas
) = true
```

This will enable FlexiJoins to support your custom function, when it's passed to `by_pred(:geometry, my_predicate_function, :geometry)`.