Skip to content

Adds ability to search keys and their corresponding values #309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: staging
Choose a base branch
from

Conversation

duckduckgrayduck
Copy link
Contributor

Limited to keys/values that appear on documents viewable to the user.
Endpoints are:
/api/documents/data/keys/
/api/documents/data/keys/{key}/values/
Chose this demonstrate that we are searching across several documents' data fields. Can change this.

Example of all keys visible to this user:
AllKeysVisibletoUser

Example of all values for a particular key (color) for this user:
AllValuesforKeyColor

Can restrain further by providing a project ID.
FilterKeysbyProject

Similarly with values
AllValuesforKeyColorFilteredByProject

I had concerns about user provided key names so I used a regular expression to validate that the input is an alphanumeric.

@duckduckgrayduck
Copy link
Contributor Author

Once tested and merged in master would close #270

@duckduckgrayduck
Copy link
Contributor Author

I have added the suggestions made, and also added the ability to filter results by partial key or value name
PartialMatchKeyName
PartialMatchValueName

@allanlasser
Copy link
Member

allanlasser commented May 8, 2025

Two quick things:

  1. Just flagging that this PR is based against staging, not master. Careful on merging.
  2. I like the URL structure and using the key as a part of the path. But I'm wondering if we need the static key and values parts of the path. Are there other possible subpaths under data or {key}? If not, I think documents/data/{key} could simplify the URL pattern:
    • documents/data/ for all keys
    • documents/data/?project=x for all project keys
    • documents/data/?key_name=y for filtered keys
    • documents/data/{key}/ for all values for a specific key
    • documents/data/{key}/?value_name=z for filtered values

@duckduckgrayduck
Copy link
Contributor Author

I've set to merge this onto staging to test performance before opening another PR to merge into master.

kv_search should be based on master. When I do:
s@s:/dev/documentcloud$ git merge-base kv_search master
4bfd7c5
s@s:
/dev/documentcloud$ git show 4bfd7c5
commit 4bfd7c5 (master)
Merge: 3610526 5c06aad
Author: Sanjin [email protected]
Date: Thu May 1 17:53:12 2025 +0000

Merge pull request #304 from MuckRock/OwnerDocumentation

Fix documentation for updates and partial updates

That's the last common ancestor, which is a PR onto master that was the last point I had interacted with before working on this branch. It's just behind master a bit.

@duckduckgrayduck
Copy link
Contributor Author

duckduckgrayduck commented May 8, 2025

Regarding url paths we have a bit of a mixed bag here.
/api/documents/{document_pk}/data/ shows not the keys, but the full key/value pair combinations on a single document.
Example: https://api.www.documentcloud.org/api/documents/25921102/data/

To retrieve the values of the cid key on this single document it then goes
/api/documents/{document_pk}/data/key_name/, which would follow your suggestion.
https://api.www.documentcloud.org/api/documents/25921102/data/cid/

If we stayed consistent it would logically follow then that /api/documents/data/ then should show all of the full key/value pair combinations across all documents visible to the user, but that isn't what we want with this, but can be added.

If we want to stay consistent we can either add an endpoint for a single document /api/documents/data/keys/ that shows us only the keys present on a single document, or make it so that /document/{document_pk}/data/ works by returning only keys. That would be a backwards incompatible change.

@allanlasser
Copy link
Member

allanlasser commented May 8, 2025

Thanks for the clarification!

Regarding url paths we have a bit of a mixed bag here. /api/documents/{document_pk}/data/ shows not the keys, but the full key/value pair combinations on a single document. Example: https://api.www.documentcloud.org/api/documents/25921102/data/

If we stayed consistent it would logically follow then that /api/documents/data/ then should show all of the full key/value pair combinations across all documents visible to the user, but that isn't what we want with this, but can be added.

The question for me here is the performance cost of returning all the values for all the documents, but if it's not high then I really like that as an enhancement of what currently exists—it's consistent with the other behavior. Then, it's up to me as a developer if I just want to read all the keys, or all the keys and their values.

If there's a high performance cost, then I agree with your approach: placing the list of just keys under the keys subpath feels correct.

I'm focused on performance because returning suggestions for search and autocomplete is a big user case, and we'd want responses as fast as possible.

One middle ground to consider would be a "keys-only" option on the list to tune performance based on the request's needs.

To retrieve the values of the cid key on this single document it then goes /api/documents/{document_pk}/data/key_name/, which would follow your suggestion. https://api.www.documentcloud.org/api/documents/25921102/data/cid/

I like this consistency between the API for keys between single documents and all documents.

If we want to stay consistent we can either add an endpoint for a single document /api/documents/data/keys/ that shows us only the keys present on a single document, or make it so that /document/{document_pk}/data/ works by returning only keys. That would be a backwards incompatible change.

Given my question about performance, I think my order of preference for consistency would be:

  1. Returning all values for all keys for /documents/data/ if the performance cost is low.
  2. Adding the api/documents/{id}/data/keys/ endpoint for single documents if the performance cost for returning values is high.

@duckduckgrayduck
Copy link
Contributor Author

I will save this PR for approach 1. I will open a 2nd PR that encapsulates this different approach to see if it is performant.
The second approach will be:

  • /documents/data/ should return all key/value pairs for all documents similar to the single document endpoint
  • /documents/data/{key_name}/ will return all of the values for a particular key name.

There will be no endpoint specifically for keys, rather that will be extracted from /documents/data/ in post-processing

@allanlasser allanlasser mentioned this pull request May 15, 2025
@allanlasser allanlasser linked an issue May 16, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot search for KVs or tags
3 participants