-
Notifications
You must be signed in to change notification settings - Fork 6
Spatial joins #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Spatial joins #113
Changes from 10 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
a54c937
Hook spatial predicates up to FlexiJoins.jl by an extension
asinghvi17 a720e96
Add a tutorial on how to use spatial joins via FlexiJoin
asinghvi17 d37642a
Add tests
asinghvi17 c69208f
Allow `GO.equals` as a predicate
asinghvi17 9bfb769
Improve docs significantly
asinghvi17 be8b0e2
Fix example name
asinghvi17 a34f904
Hide theming without causing global effects
asinghvi17 15098bc
Clean up the implementation
asinghvi17 76e068e
Implement tree based joins for speed on larger datasets
asinghvi17 da6fa4c
Apply suggestions from code review
asinghvi17 9639b44
Add instructions on how to add custom predicates
asinghvi17 2cffb43
Merge branch 'main' into as/flexijoins
asinghvi17 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,111 @@ | ||
# Spatial joins | ||
|
||
Spatial joins are joins which are based not on equality, but on some predicate ``p(x, y)``, which takes two geometries, and returns a value of either `true` or `false`. For geometries, the [`DE-9IM`](https://en.wikipedia.org/wiki/DE-9IM) spatial relationship model is used to determine the spatial relationship between two geometries. | ||
|
||
In this tutorial, we will show how to perform a spatial join on first a toy dataset and then two Natural Earth datasets, to show how this can be used in the real world. | ||
|
||
In order to perform the spatial join, we use [FlexiJoins.jl](https://github.com/JuliaAPlavin/FlexiJoins.jl) to perform the join, specifically using its `by_pred` joining method. This allows the user to specify a predicate in the following manner: | ||
```julia | ||
[inner/left/right/outer/...]join((table1, table1), | ||
by_pred(:table1_column, predicate_function, :table2_column) # & add other conditions here | ||
) | ||
``` | ||
|
||
We have enabled the use of all of GeometryOps' boolean comparisons here. These are: | ||
|
||
```julia | ||
GO.contains, GO.within, GO.intersects, GO.touches, GO.crosses, GO.disjoint, GO.overlaps, GO.covers, GO.coveredby, GO.equals | ||
``` | ||
|
||
!!! tip | ||
Always place the dataframe with more complex geometries second, as that is the one which will be sorted into a tree. | ||
|
||
## Simple example | ||
|
||
This example demonstrates how to perform a spatial join between two datasets: a set of polygons and a set of randomly generated points. | ||
|
||
The polygons are represented as a DataFrame with geometries and colors, while the points are stored in a separate DataFrame. | ||
|
||
The spatial join is performed using the `contains` predicate from GeometryOps, which checks if each point is contained within any of the polygons. The resulting joined DataFrame is then used to plot the points, colored according to the containing polygon. | ||
|
||
First, we generate our data. We create two triangle polygons which, together, span the rectangle (0, 0, 1, 1), and a set of points which are randomly distributed within this rectangle. | ||
|
||
```@example spatialjoins | ||
import GeoInterface as GI, GeometryOps as GO | ||
using FlexiJoins, DataFrames | ||
|
||
using CairoMakie, GeoInterfaceMakie | ||
|
||
pl = GI.Polygon([GI.LinearRing([(0, 0), (1, 0), (1, 1), (0, 0)])]) | ||
pu = GI.Polygon([GI.LinearRing([(0, 0), (0, 1), (1, 1), (0, 0)])]) | ||
poly_df = DataFrame(geometry = [pl, pu], color = [:red, :blue]) | ||
f, a, p = Makie.with_theme(Attributes(; Axis = (; aspect = DataAspect()))) do # hide | ||
f, a, p = poly(poly_df.geometry; color = tuple.(poly_df.color, 0.3)) | ||
end # hide | ||
``` | ||
|
||
Here, the upper polygon is blue, and the lower polygon is red. Keep this in mind! | ||
|
||
Now, we generate the points. | ||
|
||
```@example spatialjoins | ||
points = tuple.(rand(1000), rand(1000)) | ||
points_df = DataFrame(geometry = points) | ||
scatter!(points_df.geometry) | ||
f | ||
``` | ||
|
||
You can see that they are evenly distributed around the box. But how do we know which points are in which polygons? | ||
|
||
We have to join the two dataframes based on which polygon (if any) each point lies within. | ||
|
||
Now, we can perform the "spatial join" using FlexiJoins. We are performing an outer join here | ||
|
||
```@example spatialjoins | ||
@time joined_df = FlexiJoins.innerjoin( | ||
(points_df, poly_df), | ||
by_pred(:geometry, GO.within, :geometry) | ||
) | ||
``` | ||
|
||
```@example spatialjoins | ||
scatter!(a, joined_df.geometry; color = joined_df.color) | ||
f | ||
``` | ||
|
||
Here, you can see that the colors were assigned appropriately to the scattered points! | ||
|
||
## Real-world example | ||
|
||
Suppose I have a list of polygons representing administrative regions (or mining sites, or what have you), and I have a list of polygons for each country. I want to find the country each region is in. | ||
|
||
```julia real | ||
import GeoInterface as GI, GeometryOps as GO | ||
using FlexiJoins, DataFrames, GADM # GADM gives us country and sublevel geometry | ||
|
||
using CairoMakie, GeoInterfaceMakie | ||
|
||
country_df = GADM.get.(["JPN", "USA", "IND", "DEU", "FRA"]) |> DataFrame | ||
country_df.geometry = GI.GeometryCollection.(GO.tuples.(country_df.geom)) | ||
|
||
state_doublets = [ | ||
("USA", "New York"), | ||
("USA", "California"), | ||
("IND", "Karnataka"), | ||
("DEU", "Berlin"), | ||
("FRA", "Grand Est"), | ||
("JPN", "Tokyo"), | ||
] | ||
|
||
state_full_df = (x -> GADM.get(x...)).(state_doublets) |> DataFrame | ||
state_full_df.geom = GO.tuples.(only.(state_full_df.geom)) | ||
state_compact_df = state_full_df[:, [:geom, :NAME_1]] | ||
``` | ||
|
||
```julia real | ||
innerjoin((state_compact_df, country_df), by_pred(:geom, GO.within, :geometry)) | ||
innerjoin((state_compact_df, view(country_df, 1:1, :)), by_pred(:geom, GO.within, :geometry)) | ||
``` | ||
|
||
!!! warning | ||
This is how you would do this, but it doesn't work yet, since the GeometryOps predicates are quite slow on large polygons. If you try this, the code will continue to run for a very, very long time (it took 12 hours on my laptop, but with minimal CPU usage). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
module GeometryOpsFlexiJoinsExt | ||
|
||
using GeometryOps | ||
using FlexiJoins | ||
|
||
import GeometryOps as GO, GeoInterface as GI | ||
using SortTileRecursiveTree, Tables | ||
|
||
|
||
# This module defines the FlexiJoins APIs for GeometryOps' boolean comparison functions, taken from DE-9IM. | ||
|
||
# First, we define the joining modes (Tree, NestedLoopFast) that the GO DE-9IM functions support. | ||
const GO_DE9IM_FUNCS = Union{typeof(GO.contains), typeof(GO.within), typeof(GO.intersects), typeof(GO.disjoint), typeof(GO.touches), typeof(GO.crosses), typeof(GO.overlaps), typeof(GO.covers), typeof(GO.coveredby), typeof(GO.equals)} | ||
# NestedLoopFast is the naive fallback method | ||
FlexiJoins.supports_mode(::FlexiJoins.Mode.NestedLoopFast, ::FlexiJoins.ByPred{F}, datas) where F <: GO_DE9IM_FUNCS = true | ||
# This method allows you to cache a tree, which we do by using an STRtree. | ||
# TODO: wrap GO predicate functions in a `TreeJoiner` struct or something, to indicate that we want to use trees, | ||
# since they can be slower in some situations. | ||
FlexiJoins.supports_mode(::FlexiJoins.Mode.Tree, ::FlexiJoins.ByPred{F}, datas) where F <: GO_DE9IM_FUNCS = true | ||
|
||
# Nested loop support is simple, and needs no further support. | ||
# However, for trees, we need to define how the tree is prepared and how it is used. | ||
# This is done by defining the `prepare_for_join` function to return an STRTree, | ||
# and by defining the `findmatchix` function as querying that tree before checking | ||
# intersections. | ||
|
||
# In theory, one could extract the tree from e.g a GeoPackage or some future GeoDataFrame. | ||
|
||
FlexiJoins.prepare_for_join(::FlexiJoins.Mode.Tree, X, cond::FlexiJoins.ByPred{<: GO_DE9IM_FUNCS}) = (X, SortTileRecursiveTree.STRtree(map(cond.Rf, X))) | ||
function FlexiJoins.findmatchix(::FlexiJoins.Mode.Tree, cond::FlexiJoins.ByPred{F}, ix_a, a, (B, tree)::Tuple, multi::typeof(identity)) where F <: GO_DE9IM_FUNCS | ||
idxs = SortTileRecursiveTree.query(tree, cond.Lf(a)) | ||
intersecting_idxs = filter!(idxs) do idx | ||
cond.pred(a, cond.Rf(B[idx])) | ||
end | ||
return intersecting_idxs | ||
end | ||
|
||
# Finally, for completeness, we define the `swap_sides` function for those predicates which are defined as inversions. | ||
|
||
FlexiJoins.swap_sides(::typeof(GO.contains)) = GO.within | ||
FlexiJoins.swap_sides(::typeof(GO.within)) = GO.contains | ||
FlexiJoins.swap_sides(::typeof(GO.coveredby)) = GO.covers | ||
FlexiJoins.swap_sides(::typeof(GO.covers)) = GO.coveredby | ||
|
||
# That's a wrap, folks! | ||
|
||
end | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
import GeometryOps as GO, GeoInterface as GI | ||
using FlexiJoins, DataFrames | ||
|
||
pl = GI.Polygon([GI.LinearRing([(0, 0), (1, 0), (1, 1), (0, 0)])]) | ||
pu = GI.Polygon([GI.LinearRing([(0, 0), (0, 1), (1, 1), (0, 0)])]) | ||
poly_df = DataFrame(geometry = [pl, pu], color = [:red, :blue]) | ||
|
||
points = tuple.(rand(100), rand(100)) | ||
points_df = DataFrame(geometry = points) | ||
|
||
@testset "Basic integration" begin | ||
|
||
@test_nowarn joined_df = FlexiJoins.innerjoin((poly_df, points_df), by_pred(:geometry, GO.contains, :geometry)) | ||
# Test that the join happened correctly | ||
joined_df = FlexiJoins.innerjoin((poly_df, points_df), by_pred(:geometry, GO.contains, :geometry)) | ||
@test all(GO.contains.((pl,), joined_df.geometry_1[joined_df.color .== :red])) | ||
@test all(GO.contains.((pu,), joined_df.geometry_1[joined_df.color .== :blue])) | ||
# Test that within also works | ||
@test_nowarn joined_df = FlexiJoins.innerjoin((points_df, poly_df), by_pred(:geometry, GO.within, :geometry)) | ||
|
||
end | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to wrap this and avoid the
:geometry
:geometry
partLike this maybe:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you're saying here now - but until we resolve the DataFrames/GI.geometrycolumn/ArchGDAL mess, it's probably best to keep this explicit IMO :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it actually a mess? It mostly just works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case yes - GADM for example only outputs tables with
geom
columns, as do many other ArchGDAL-based loaders.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArchGDAL tables are fine, GADM returns a NamedTuple. Theres an issue for that