Sharing 1.5 - Single Whitelist Persistence #351

jbee · 2024-10-29T10:34:05Z

jbee
Oct 29, 2024
Collaborator

Motivation

The current sharing system is multi-dimensional making it unnecessary hard to understand and work with.
This impacts both complexity and performance.
However, for practical reasons it is hard to switch to a fundamentally different system.
Ideally the user experience and interaction with sharing should not change to avoid consts of transitioning.

Practical Example

With the current Sharing model there are multiple fields to check.
Two are maps of UID => access pattern.
A SQL to evaluate this has multiple parts because there are multiple fields to check.
For the maps with each group a user is a member of the query gets another expression making the SQL long, complex and thus it is fair to assume costly in terms of performance.

A filter looks something like this

{sharing.external matches X} 
OR {sharing.public matches Y} 
OR {sharing.user matches Z} 
OR {sharing.group matches G1} 
OR {sharing.group matches G2}
...

Proposal

This proposal is a solution that only changes how the sharing information is stored and processed but it should be possible to reconstruct the current API layout from the structure.

The sharing is a whitelist of UIDs and special tokens for read and write.
Empty sets are omitted.

{
 "r": ["{uid1}", "{uid2}", ...], 
 "w": ["{uid1}", "{uid2}", ...]
}

In Java

class Sharing {
Set<String> r;
Set<String> w;
}

The proposal is also to drop the data vs metadata distinction and to always imply both.
If this should be maintained more sets could be used or UIDs can be extended with a prefix character (see tokens).

Lookup

For a lookup it is always known upfront which of the 2/4 lists to check.
So the check itself is always a check for set intersection.
The users set of associated UIDs and tokens is checked against the whitelist of sharing.
If there is at least 1 contained in both access is allowed.
This way the SQL needed to perform sharing checks is a single X in Y where X and Y are JSON arrays of string.
For an in memory check this equally is a Set<String> containsAny check.

A filter would always look like this

{sharing.r intersectsWith [UID1, UID2, ...]}

Tokens

In the whitelists sets UIDs of users and user groups would be mixed.
The set would also allow non-UID tokens with special meaning. For example, a token to allow public access, instead of putting an UID in the set "p" could be added to symbolize that public access is allowed.

UIDs could also be prefixed with a token to mark them as read/write/dataread/datawrite instead of having multiple lists. E.g. r{uid} to give metadata read access to the UID.

Performance

From a performance standpoint it makes most sense to use an actual JSON array as the sole structure as that can be indexed in postgres AFAIK.

In such a form it is clear that tokens need to be used, e.g. r{uid} (user/group can read) and w{uid} (user/group can write) etcetera
JSON

[token1, token2]

Open Issues

The issues are not related to the representation but to the sharing concept itself. These are good to keep in mind when changing the design to maybe solve some of them in the process.

ATM a user can see information that should not be accessible according to sharing as long as the user can use another root object to start from which references the object. If that is the case the user can use fields=ref[a,b,c] to see a,b,c of the reference from another root.
does data sharing always imply metadata sharing? if not this is another loophole that is hard to get right. for example when filtering metadata objects using sharing users with data read access could accidentally get access to metadata that they do not have metadata read access to. so data always implying metadata by definition would greatly help in falling into that trap. if this semantic is chosen the split into data and metadata read/write might better be represented as access levels that build upon each other but that all are on a shared axis. Again that would also help to represent them in a single list which represents such axis.
applying sharing consistently is hard as there are many code-path to read (e.g. single object vs list views)

stian-sandvold · 2025-02-12T08:36:21Z

stian-sandvold
Feb 12, 2025
Maintainer

Thanks for exploring this @jbee . I'm adding my comments below:

For the maps with each group a user is a member of the query gets another expression making the SQL long, complex and thus it is fair to assume costly in terms of performance.

I am not sure this is correct - I would assume a simple join between a user and it's user groups would be enough to simplify the query if that is the case. Regardless, I do see the point in the complexity overall.

The sharing is a whitelist of UIDs and special tokens for read and write.

A couple of comments:
I think using special tokens would increase the complexity of this solution, and I much prefer the separate columns for the type of access. Using the token, would mean making the value non-atomic, and I feel like in that case, indexing the values would lock us in to always needing to specify the token, even when we might not care. It also means if we want to join on these values at some point, we now need to perform expensive string operations to remove the token.

Regardless of the previous comment, now we end up with data duplication. For each level of access a user or usergroup has, it will now have either 1 (r) 2 (rw/w) 3 (dr) or 4 (dw) representations, one for each type of access - Provided we assume a level of access implies the lower levels of access. An alternative solution here, could be that each UID is just present in it's highest level of access, and the query to check actually check all 4 columns, exiting at the first hit.

In regards to both comments above, storing all the access as jsonb in the form of tuples ({uid: , access: }) I think would allow us to check the presence of any uids, without regard to access, while at the same time get all access of a specific type, and avoid duplicating uids. In this case I do think that we would need to use a format of the access to be able to denote all levels in one string, like we have today.

The hardest problem I foresee, is how we deal with updates - When a user group is deleted, when a user is deleted, etc. Would this be something we could easily switch into today, or do we need a new process for handling this?

Finally, how do we represent the special access "owner"? Do we keep it at the top level like today, or will this also be moved into the jsonb?

The proposal is also to drop the data vs metadata distinction and to always imply both.

No, we cannot do that. Being able to write data, does not imply you can change the metadata. And reading metadata definitively does not mean you can read data. This is a much bigger concern in tracker though, since it's individual level data. There are also convoluted sharing setups in tracker to allow partial access to data (Like a hospital being able to create an event for a lab test, and a lab being able to fill out the event without accessing the patient information; Campaigns also have a convoluted setup, but I cant as easily describe that)

If there is at least 1 contained in both access is allowed.

I'm not sure if I follow, I assume you mean in your example if it's contained in either of the columns, access is allowed (In the case of checking READ access and UID is in either READ or WRITE column?)

A filter would always look like this
{sharing.r intersectsWith [UID1, UID2, ...]}

How would this look in extreme cases? I assume you might easily hit the max length supported for a query if you add a significant list of usergroup uids. This also does not handle the case of owner or public, so you would still be left with 2 more conditions?

In the whitelists sets UIDs of users and user groups would be mixed.

Thinking about the sharing dialogue in the UI: When I want to show the user who currently has what access, how do I get that information? With this setup of mixed UIDs, there is no way to identify which is a user and which is a usergroup, without manually checking each one?

The set would also allow non-UID tokens with special meaning. For example, a token to allow public access, instead of putting an UID in the set "p" could be added to symbolize that public access is allowed.

That is fine, but again, mixing different types of values smells like trouble. The alternative I suggested earlier might work fine, both by adding a new field "type" which can be "user", "group" or "public"(Or something else)

From a performance standpoint it makes most sense to use an actual JSON array as the sole structure as that can be indexed in postgres AFAIK.

I agree, and without testing, I would expect it to be pretty good for most of our usecases, where we know either uid or both uid and access. I'm not sure if that requires 1 or 2 indexes though. I am not entirely sure how jsonb gets indexed.

The r{uid} idea thought means we would need string operations to extract the uid, and string operations are incredibly expensive in postgres.

ATM a user can see information that should not be accessible according to sharing as long as the user can use another root object to start from which references the object. If that is the case the user can use fields=ref[a,b,c] to see a,b,c of the reference from another root.

This is definetively a problem we want to address, but I am not sure this is the right side to start from, I think we need to look into the metadata filtering first. Might be a good idea to identify the requirements for performance there first, before we decide on an approach here.

does data sharing always imply metadata sharing? if not this is another loophole that is hard to get right. for example when filtering metadata objects using sharing users with data read access could accidentally get access to metadata that they do not have metadata read access to. so data always implying metadata by definition would greatly help in falling into that trap. if this semantic is chosen the split into data and metadata read/write might better be represented as access levels that build upon each other but that all are on a shared axis. Again that would also help to represent them in a single list which represents such axis.

Data sharing( data read or data write ) both imply metadata read access; IE, they need to see the metadata/know of it's existence to be able to write to it. It data read or data write does not mean they can change the metadata

Overall, I like the idea, and I think we should continue exploring, but I think there's a few things about the proposed solution we need to figure out first!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sharing 1.5 - Single Whitelist Persistence #351

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Sharing 1.5 - Single Whitelist Persistence #351

Uh oh!

Uh oh!

jbee Oct 29, 2024 Collaborator

Motivation

Practical Example

Proposal

Lookup

Tokens

Performance

Open Issues

Replies: 1 comment

Uh oh!

stian-sandvold Feb 12, 2025 Maintainer

jbee
Oct 29, 2024
Collaborator

stian-sandvold
Feb 12, 2025
Maintainer