Skip to content

Spatial joins #113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 23, 2024
Merged
1 change: 1 addition & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ CoordinateTransformations = "150eb455-5306-5404-9cee-2592286d6298"
GeoInterface = "cf35fbd7-0cd7-5166-be24-54bfbe79505f"
GeometryBasics = "5c1252a2-5f33-56bf-86c9-59e7332b4326"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
SortTileRecursiveTree = "746ee33f-1797-42c2-866d-db2fce69d14d"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"

Expand Down
50 changes: 42 additions & 8 deletions docs/src/tutorials/spatial_joins.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ In this tutorial, we will show how to perform a spatial join on first a toy data

In order to perform the spatial join, we use [FlexiJoins.jl](https://github.com/JuliaAPlavin/FlexiJoins.jl) to perform the join, specifically using its `by_pred` joining method. This allows the user to specify a predicate in the following manner:
```julia
[inner/left/outer/...]join((table1, table1),
by_pred(:table1_column, predicate_function, :table2_column)
[inner/left/right/outer/...]join((table1, table1),
by_pred(:table1_column, predicate_function, :table2_column) # & add other conditions here
)
```

Expand All @@ -17,6 +17,9 @@ We have enabled the use of all of GeometryOps' boolean comparisons here. These
GO.contains, GO.within, GO.intersects, GO.touches, GO.crosses, GO.disjoint, GO.overlaps, GO.covers, GO.coveredby, GO.equals
```

!!! tip
Always place the dataframe with more complex geometries second, as that is the one which will be sorted into a tree.

## Simple example

This example demonstrates how to perform a spatial join between two datasets: a set of polygons and a set of randomly generated points.
Expand Down Expand Up @@ -46,27 +49,28 @@ Here, the upper polygon is blue, and the lower polygon is red. Keep this in min
Now, we generate the points.

```@example spatialjoins
points = tuple.(rand(100), rand(100))
points = tuple.(rand(1000), rand(1000))
points_df = DataFrame(geometry = points)
scatter!(points_df.geometry)
f
```

You can see that they are evenly distributed around the box. But how do we know which points are in which polygons?

The answer here is to perform a spatial join.
We have to join the two dataframes based on which polygon (if any) each point lies within.

Now, we can perform the "spatial join" using FlexiJoins. We are performing an outer join here

```@example spatialjoins
joined_df = FlexiJoins.innerjoin(
(poly_df, points_df),
by_pred(:geometry, GO.contains, :geometry)
@time joined_df = FlexiJoins.innerjoin(
(points_df, poly_df),
by_pred(:geometry, GO.within, :geometry)
)
```

```@example spatialjoins
scatter(joined_df.geometry_1; color = joined_df.color)
scatter!(a, joined_df.geometry; color = joined_df.color)
f
```

Here, you can see that the colors were assigned appropriately to the scattered points!
Expand All @@ -75,3 +79,33 @@ Here, you can see that the colors were assigned appropriately to the scattered p

Suppose I have a list of polygons representing administrative regions (or mining sites, or what have you), and I have a list of polygons for each country. I want to find the country each region is in.

```julia real
import GeoInterface as GI, GeometryOps as GO
using FlexiJoins, DataFrames, GADM # GADM gives us country and sublevel geometry

using CairoMakie, GeoInterfaceMakie

country_df = GADM.get.(["JPN", "USA", "IND", "DEU", "FRA"]) |> DataFrame
country_df.geometry = GI.GeometryCollection.(GO.tuples.(country_df.geom))

state_doublets = [
("USA", "New York"),
("USA", "California"),
("IND", "Karnataka"),
("DEU", "Berlin"),
("FRA", "Grand Est"),
("JPN", "Tokyo"),
]

state_full_df = (x -> GADM.get(x...)).(state_doublets) |> DataFrame
state_full_df.geom = GO.tuples.(only.(state_full_df.geom))
state_compact_df = state_full_df[:, [:geom, :NAME_1]]
```

```julia real
innerjoin((state_compact_df, country_df), by_pred(:geom, GO.within, :geometry))
innerjoin((state_compact_df, view(country_df, 1:1, :)), by_pred(:geom, GO.within, :geometry))
```

!!! warning
This is how you would do this, but it doesn't work yet, since the GeometryOps predicates are quite slow on large polygons. If you try this, the code will continue to run for a very, very long time (it took 12 hours on my laptop, but with minimal CPU usage).
45 changes: 29 additions & 16 deletions ext/GeometryOpsFlexiJoinsExt/GeometryOpsFlexiJoinsExt.jl
Original file line number Diff line number Diff line change
@@ -1,28 +1,41 @@
module GeometryOpsFlexiJoinsExt

using GeometryOps
import GeometryOps as GO
using FlexiJoins

# This module defines the FlexiJoins APIs for GeometryOps' boolean comparison functions, taken from DE-9IM.
import GeometryOps as GO, GeoInterface as GI
using SortTileRecursiveTree, Tables

# For now, we only allow regular n^2 loops for the predicates, primarily because I'm not entirely sure how to do the "fast" mode for these.
# TODO: fix that and allow faster joins.

# First, we define that FlexiJoins supports the "NestedLoopFast" mode for the predicates.
# This module defines the FlexiJoins APIs for GeometryOps' boolean comparison functions, taken from DE-9IM.

FlexiJoins.supports_mode(::FlexiJoins.Mode.NestedLoopFast, ::FlexiJoins.ByPred{typeof(GO.contains)}, datas) = true
FlexiJoins.supports_mode(::FlexiJoins.Mode.NestedLoopFast, ::FlexiJoins.ByPred{typeof(GO.within)}, datas) = true
FlexiJoins.supports_mode(::FlexiJoins.Mode.NestedLoopFast, ::FlexiJoins.ByPred{typeof(GO.intersects)}, datas) = true
FlexiJoins.supports_mode(::FlexiJoins.Mode.NestedLoopFast, ::FlexiJoins.ByPred{typeof(GO.disjoint)}, datas) = true
FlexiJoins.supports_mode(::FlexiJoins.Mode.NestedLoopFast, ::FlexiJoins.ByPred{typeof(GO.touches)}, datas) = true
FlexiJoins.supports_mode(::FlexiJoins.Mode.NestedLoopFast, ::FlexiJoins.ByPred{typeof(GO.crosses)}, datas) = true
FlexiJoins.supports_mode(::FlexiJoins.Mode.NestedLoopFast, ::FlexiJoins.ByPred{typeof(GO.overlaps)}, datas) = true
FlexiJoins.supports_mode(::FlexiJoins.Mode.NestedLoopFast, ::FlexiJoins.ByPred{typeof(GO.covers)}, datas) = true
FlexiJoins.supports_mode(::FlexiJoins.Mode.NestedLoopFast, ::FlexiJoins.ByPred{typeof(GO.coveredby)}, datas) = true
FlexiJoins.supports_mode(::FlexiJoins.Mode.NestedLoopFast, ::FlexiJoins.ByPred{typeof(GO.equals)}, datas) = true
# First, we define the joining modes (Tree, NestedLoopFast) that the GO DE-9IM functions support.
const GO_DE9IM_FUNCS = Union{typeof(GO.contains), typeof(GO.within), typeof(GO.intersects), typeof(GO.disjoint), typeof(GO.touches), typeof(GO.crosses), typeof(GO.overlaps), typeof(GO.covers), typeof(GO.coveredby), typeof(GO.equals)}
# NestedLoopFast is the naive fallback method
FlexiJoins.supports_mode(::FlexiJoins.Mode.NestedLoopFast, ::FlexiJoins.ByPred{F}, datas) where F <: GO_DE9IM_FUNCS = true
# This method allows you to cache a tree, which we do by using an STRtree.
# TODO: wrap GO predicate functions in a `TreeJoiner` struct or something, to indicate that we want to use trees,
# since they can be slower in some situations.
FlexiJoins.supports_mode(::FlexiJoins.Mode.Tree, ::FlexiJoins.ByPred{F}, datas) where F <: GO_DE9IM_FUNCS = true

# Nested loop support is simple, and needs no further support.
# However, for trees, we need to define how the tree is prepared and how it is used.
# This is done by defining the `prepare_for_join` function to return an STRTree,
# and by defining the `findmatchix` function as querying that tree before checking
# intersections.

# In theory, one could extract the tree from e.g a GeoPackage or some future GeoDataFrame.

FlexiJoins.prepare_for_join(::FlexiJoins.Mode.Tree, X, cond::FlexiJoins.ByPred{<: GO_DE9IM_FUNCS}) = (X, SortTileRecursiveTree.STRtree(cond.Rf(X)))
function FlexiJoins.findmatchix(::FlexiJoins.Mode.Tree, cond::FlexiJoins.ByPred{F}, ix_a, a, (B, tree)::Tuple, multi::typeof(identity)) where F <: GO_DE9IM_FUNCS
idxs = SortTileRecursiveTree.query(tree, a)
intersecting_idxs = filter(idxs) do idx
cond.pred(a, cond.Rf(B)[idx])
end
return intersecting_idxs
end

# Next, just in case, we define the `swap_sides` function for those predicates which are defined as inversions.
# Finally, for completeness, we define the `swap_sides` function for those predicates which are defined as inversions.

FlexiJoins.swap_sides(::typeof(GO.contains)) = GO.within
FlexiJoins.swap_sides(::typeof(GO.within)) = GO.contains
Expand Down