-
Notifications
You must be signed in to change notification settings - Fork 1
Issue 256 - move camsl to geo_point table #263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a high level review, focusing mostly on the protobuf and postgres migrations. I have added some comments around these, and will try to get to the golang code later.
A general question is how we test this, as there is no ingestion and/or API support to see if inserting and querying actually works (unless I am missing something).
Ideally, we would have the ingestion and API in the same PR, so that we can add some camsl
's to the test dataset, and confirm that everything works by testing that the data shows up in the API, preferable using the camsl
range query.
string processing_level = 9 [json_name = "processing_level"]; | ||
int64 quality_code = 10 [json_name = "quality_code"]; | ||
int64 camsl = 11; // centimeters above mean sea level | ||
optional int32 camsl = 3; // centimeters above mean sea level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we okay with re-ordering all the indices (also the fields below)?
It makes the messages incompatible (the docs say you should never do this):
https://protobuf.dev/programming-guides/proto3/#consequences
We do tend to change all the components at once, but in the EWC setup they are on separate machines, so there might still be some issues.
I think it might actually lead to storing garbage.
Why just not leave caml at 11, and the rest untouched? You can leave the order in the file as it is now, I don't think it is used for anything.
// '10', '1*', '*0', '*'. | ||
// | ||
map<string, Strings> filter = 5; | ||
map<string, Strings> filter = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid change in field number.
repeated string included_response_fields = 7; | ||
|
||
// repeated string excluded_response_fields = 7; // TODO | ||
// repeated string excluded_response_fields = 8; // TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid change in field number.
string camsl_range = 4; | ||
map<string, Strings> filter = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid change in field number.
@@ -1 +1,9 @@ | |||
ALTER TABLE observation ADD COLUMN camsl BIGINT; | |||
ALTER TABLE geo_point ADD COLUMN camsl INTEGER; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I start on master
branch, load data, switch to your branch, and run the migrations (just up
), nothing happens. This is because you reused an existing migration file, which according to the DB table schema_migrations
was already applied.
Now we can just assume that production systems never ran the original version of the migration... but it is much safer to make a new one where you drop the column in observations
, and add a new on to geo_point
.
The only potential issue is then that any camsl
data that was already there is ignore... but I don't thing the ingest
was changed to put anything in, so this is fine.
ALTER TABLE geo_point ADD COLUMN camsl INTEGER; | ||
|
||
-- drop UNIQUE constraint of 'point' column | ||
-- WARNING: we assume that the constraint name is the correct one (it was never explicitly set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original table definition was this:
id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
point GEOGRAPHY(Point, 4326) NOT NULL UNIQUE
);
CREATE INDEX geo_point_idx ON geo_point USING GIST(point);
This gave the following constraint and indieces:
After forcing the migration to your branch, I have the following constraint and indices:
This corresponds to the indices that you would get with the following table definition (without any migrations):
CREATE TABLE geo_point(
id BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
point GEOGRAPHY(Point, 4326),
camsl INTEGER,
CONSTRAINT geo_point_point_camsl_key UNIQUE NULLS NOT DISTINCT (point, camsl)
);
CREATE INDEX geo_point_idx ON geo_point USING GIST(point);
So I guess the question is, do we want to have a GIST
index that includes camsl
? This depends on the query, which I haven't looked at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The upserting 2 points
example should also change, correct?
lat float64 | ||
lon float64 | ||
lat float64 | ||
camsl *int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a pointer? So we can use nil
for not set/NULL
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to also fix just load
. Gives concurrency errors...
SELECT c.id, point, camsl FROM input_rows | ||
JOIN geo_point c USING (point, camsl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT c.id, point, camsl FROM input_rows | |
JOIN geo_point c USING (point, camsl) | |
SELECT c.id, c.point, c.camsl FROM input_rows i | |
JOIN geo_point c ON i.camsl IS NOT DISTINCT FROM c.camsl AND i.point=c.point; -- Workaround for JOIN on NULL's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this fixes the "concurrency" issue on the point table during data load, and the integration test.
Fixes Issue 256.