Skip to content

Conversation

AAraKKe
Copy link
Contributor

@AAraKKe AAraKKe commented Sep 10, 2025

What does this PR do?

Original implementation

The implementation outlined here is the original idea when the PR was opened. After review, the implementation has changed quite a bit.

This PR adds support for customizable cache keys to be used with the agent persistent cache. This is implemented in the following way:

  • A CacheKey abstract class that can be extended to implement any cache invalidation logic needed.
  • The AgentCheck has been modified to rely on a CacheKey implementation to provide the key for the persistent cache internally. Using check_id is now the default behavior through the FullConfigCacheKey implementation but can be overriden when necessary.
  • The logs_persistent_cache_key can be overriden by a developer that wants to modify what is the CacheKey used to handle the logs cursor.
  • We are still maintaining some level of separation between checks by adding the check name to the key. It is not the full check_id but, if the developer creates a CacheKey that is to simple, the impact of key clashing is reduced to the same kind of check.
  • Added 2 implementations of the CacheKey interface that work as an example and to provide funcionality:
    • The FullConfigCacheKey is equivalent to the previous behavior. It relies on the check_id that includes a digest of the full configuration of the check. This default behavior makes this change backward compatible with the usage of send_log and get_log_cursor, both of them using persistent cache.
    • The ConfigSetCacheKey implementation invalidates the cache only when a subset of configuration options have been updated. This provides a similar level of separation between different checks but allowing some configuration options to not invalidate the cache.
  • The CacheKeyManager allows multiple cache keys to coexist in the AgentCheck for differnet purposes. At the moment we only use the CacheKey internally to manage the log cursor. However, if we want to include something else we can register a new type of CacheKeyType enum and maintain the new one in the manager without having to modify how the AgentCheck behaves.

Note: the manager forces to have each cache key represented by an enum to avoid relying on simple string keys per cache key. While the intention is to allow to personalize the cache invalidation mechanism, we do not want potentially arbitrary cache keys being register which could trigger cache misses if the string key is mistakenly generated.

This PR adds support for customizing the persistent cache prefix used when caching any data in the AgentCheck. This is done by allowing the developer to override the persistent_cache_id method. This method returns the ID that identifies the specific check instance in the Agent persistent cache.

When a given check returns the same persistent cache ID, the cache keys generated will be stable. By default, the persistent cache ID is the digest of the full check configuration. Any update in the configuration will generate a different ID, causing the cache to be invalidated. This is the current behavior before introducing this feature, which ensures backward compatibility.

Motivation

Until now, integrations that wanted to emit logs and be consistent between agent restarts would have logs missing or duplicated if the user changed their configuration. This would invalidate the cache and there was limited control over the last cursor.

The intention of this change is to provide a backward compatible solution that allows developers to modify the cache invalidation behavior in an easy way.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

@AAraKKe AAraKKe added the qa/skip-qa Automatically skip this PR for the next QA label Sep 10, 2025
Copy link

github-actions bot commented Sep 10, 2025

⚠️ The qa/skip-qa label has been added with shippable changes

The following files, which will be shipped with the agent, were modified in this PR and
the qa/skip-qa label has been added.

You can ignore this if you are sure the changes in this PR do not require QA. Otherwise, consider removing the label.

List of modified files that will be shipped with the agent
datadog_checks_base/datadog_checks/base/utils/persistent_cache.py
datadog_checks_base/changelog.d/21316.added
datadog_checks_base/datadog_checks/base/checks/base.py

@AAraKKe AAraKKe marked this pull request as ready for review September 10, 2025 16:54
@AAraKKe AAraKKe requested a review from a team as a code owner September 10, 2025 16:54
@AAraKKe AAraKKe changed the title Add support for customizable cache keys to store log cursor Add support for customizable cache keys for persistent cache Sep 10, 2025
Copy link

codecov bot commented Sep 10, 2025

Codecov Report

❌ Patch coverage is 98.44961% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.04%. Comparing base (045fe96) to head (91604c3).
⚠️ Report is 28 commits behind head on master.

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@ofek ofek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job trying to satisfy customer needs! I left a few comments. Thanks for pinging me on this one 🙏

Overall, I'm not convinced that the concept of cache key types is warranted. Rather, you could take a much simpler approach by defining a method that returns the expected persistent cache key prefix (defaulting to the check ID) which could be overridden. The return value would then be cached via a check initialization scheme or otherwise, and that static prefix would be used ubiquitously.

@AAraKKe AAraKKe changed the title Add support for customizable cache keys for persistent cache [AI-6101] Add support for customizable cache keys for persistent cache Sep 11, 2025
Copy link
Contributor

@ofek ofek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more things, I'll take a look again tomorrow!

the value to store
"""
datadog_agent.write_persistent_cache(self._persistent_cache_id(key), value)
datadog_agent.write_persistent_cache(key, value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a breaking change.

Copy link
Contributor

@dkirov-dd dkirov-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change overall! 🚀
I only have a couple of nit picks.

Copy link
Contributor

@dkirov-dd dkirov-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In regards to the key/base_key question, here is a suggestion to enforce the overriding of base_key instead of key.
My understanding is that the key method should never be overridden, therefore the final decorator can be used.

Another suggestion: rename base_key to key_suffix to be more explicit about the role of the method in the cache key construction.
IIUC, key() already implements the cache invalidation between different check IDs. Then the key_suffix allows for more granular control over the invalidation logic.

Copy link
Contributor

@ofek ofek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my final recommendation. It will reduce the diff by 80% and incorporates @dkirov-dd's idea about the more semantically correct and intuitive key_suffix naming (although I flip it).

Let's create a new method on the base class which customers may override:

def get_persistent_cache_id(self) -> str:
    """
    This ID is used to form the final prefix of all persistent cache key lookups.
    """
    # The last component of the check ID is the hash of the instance configuration
    return self.check_id.split(':')[-1]

There will be another method that becomes a check initialization which will set a private property:

# __init__
    self.__persistent_cache_key_prefix: str | None = None

def __set_persistent_cache_key_prefix(self) -> None:
    # Remove the instance hash from the check ID
    namespace = ':'.join(self.check_id.split(':')[:-1])
    self.__persistent_cache_key_prefix = f'{namespace}:{self.get_persistent_cache_id()}_'

Finally, change the API calls:

    def read_persistent_cache(self, key: str) -> str:
        return datadog_agent.read_persistent_cache(f'{self.__persistent_cache_key_prefix}{key}')

    def write_persistent_cache(self, key: str, value: str) -> str:
        return datadog_agent.write_persistent_cache(f'{self.__persistent_cache_key_prefix}{key}', value)

The benefits of this approach are that:

  1. There is no longer any overhead for accessing the cache with the proper key prefix. All computation happens once when the integration starts.
  2. Customers retain complete control without having to create a dedicated class just to return a string. They now have 2 options:
    1. We can create a simple utility for them to use within the overridden method.
    2. They can override the method to return an empty string and rely exclusively on the name option of each instance.

@AAraKKe
Copy link
Contributor Author

AAraKKe commented Sep 15, 2025

@ofek , answering to your suggestion. I understand that your point is to remove the the need of a class, the rest is essentially the same except for when the initialization happens (maybe?).

I just realized we have a way of running delayed initialization in the check through the check_initializations deque. I added the initialization there to avoid having to check for initialization every time we submit a log. The only difference is that the entry point for that is the run method. This means that the check needs to be launched for the cache to be initialized. I have checked in all the integrations we have and none of them writes or read from the cache before the run method. I also don't think there is a way for this to happen since the entry point of the check is the run method.

About not having a class, I disagree here. The intention of having a class is to allow reusable cache invalidation mechanisms. Partners often have the same way they want to invalidate log cursors. Having a way of encapsulating that logic in a reusable isolated way would be a great benefit for them. The option of not having a class means that the check itself would be littered with logic for cache invalidation logic. It would force developers to create their own custom version of the check if they want to reuse it when the only specialization they want is the logic for cache invalidation. I really think that encapsulating the logic into a class is a better choice here.

The overhead we are talking about here is:

  • Create the instance of the class, once per check initialization.
  • Call the method in the class to create the string vs just do a string concatenation.

It is true that there is an overhead, but string concatenation is orders of magnitude below the typical latency of check operations to get the logs/metrics which is mostly IO bound.

In any case, even though the overhead here when compared with the operations a check normally does is pretty small I appreciate it still exists. I have reshaped the implementation to have the best of both worlds:

  • Rename the approach from a solution to derive cache keys, to a solution to design cache invalidation strategies.
  • The instance of the class that encapsulates the logic is created only once, on initialization.
  • The invalidation strategy has a invalidation_token method that can be overridden to define when the cache will be invalidated.
  • The key_prefix method generates the key prefix based on the invalidation strategy.
  • The key_prefix is what is cached on initialization and we avoid the overhead of calling a method on the class to generate the cache key every time.

I think this addresses your concern about performance while at the same time allowing encapsulating different cache invalidation strategies on their own context.

Let me know what you think.

Note: I extracted the check_initializations calls to its own method which cleans up the run method a bit and allows calling it from methods that require the cache to be initialized.

@ofek
Copy link
Contributor

ofek commented Sep 15, 2025

I no longer have time to devote to reviewing this but please consider the core benefit of my final recommendation which you didn't seem to mention. The key prefix is purely an implementation detail and shouldn't be exposed to users. The novel approach in my last comment was for the customer to define an ID and the prefix is handled automatically.

In my suggestion there is no need for separate classes, the concept of cache invalidation, nor does the customer even need to spend minutes reading docs and grokking the details. We can just provide them with a method that does the hash like you mentioned and if they have custom logic it can really be a single very simple function that computes an ID, whatever that means within the context of the integration. They also would be able to simplify their approach and just rely on instance's name key.

edit: Put more plainly, I think this will be very confusing for customers because I'm intimately familiar with how to write integrations and even I cannot articulate how to use the current approach without re-reading the code in the files tab 😅

@AAraKKe
Copy link
Contributor Author

AAraKKe commented Sep 15, 2025

I really cannot see how having the logic encapsulated in classes is more complex than the option you propose. I see clear benefits in this approach but I also see appreciate that the entry point for a developer to define custom invalidation mechanisms for the cache is easier with your approach.

I also have the impression that if a developer is considering modifying the default behavior of the consistent cache, they are already exposed to the inner workings of the persistent caching mechanism. However, I see a great benefit on your approach and that is that the API is more flexible. With an API like the one you propose, a developer can decide to encapsulate, or not, any logic as complex as it can be and then just return the ID that identifies the particular check instance in the persistent cache.

I have updated the PR and provide the config set method as part of the utils.perisistent_cache module. If a developer want to create a cache ID based on a subset of configuration options they can still do it.

Thank you both for the time reviewing the PR! I really appreciate it.

Copy link
Contributor

@ofek ofek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now, just final suggestion. Thanks again!

@AAraKKe AAraKKe enabled auto-merge September 16, 2025 07:48
@AAraKKe AAraKKe disabled auto-merge September 16, 2025 07:48
@AAraKKe AAraKKe merged commit fc7560f into master Sep 16, 2025
304 of 307 checks passed
@AAraKKe AAraKKe deleted the aarakke/add-custom-stable-cache-key branch September 16, 2025 07:49
datadog-agent-integrations-bot bot pushed a commit that referenced this pull request Sep 16, 2025
#21316)

* Add support for customizable cache keys to store log cursor

* Add changelog

* Remove PersistentCacheKey lefotver from AgentCheck

* Fix docstring mistakes

* Reduce cache key scope to checks with the same name

* Add test to cover restart behavior

* Address most of the comments on the PR

* Fix imports in tests

* Replace manual sorthing by mutable_hash utility

* Update changelog message

* Remove CacheKeyManager

* Reword docstrings

* Finalize update of ConfigSet

* Reframe implementation towards a invalidation strategy more than the full key derivation

* Fix module renaming and keep tests dry

* Fix typo in prefix method

* Simplify implementation to use persistent_cache_id

* Simplify id namespace calculation

(cherry picked from commit fc7560f)
github-actions bot pushed a commit that referenced this pull request Sep 16, 2025
#21316)

* Add support for customizable cache keys to store log cursor

* Add changelog

* Remove PersistentCacheKey lefotver from AgentCheck

* Fix docstring mistakes

* Reduce cache key scope to checks with the same name

* Add test to cover restart behavior

* Address most of the comments on the PR

* Fix imports in tests

* Replace manual sorthing by mutable_hash utility

* Update changelog message

* Remove CacheKeyManager

* Reword docstrings

* Finalize update of ConfigSet

* Reframe implementation towards a invalidation strategy more than the full key derivation

* Fix module renaming and keep tests dry

* Fix typo in prefix method

* Simplify implementation to use persistent_cache_id

* Simplify id namespace calculation fc7560f
github-actions bot pushed a commit to ConnectionMaster/integrations-core that referenced this pull request Sep 16, 2025
DataDog#21316)

* Add support for customizable cache keys to store log cursor

* Add changelog

* Remove PersistentCacheKey lefotver from AgentCheck

* Fix docstring mistakes

* Reduce cache key scope to checks with the same name

* Add test to cover restart behavior

* Address most of the comments on the PR

* Fix imports in tests

* Replace manual sorthing by mutable_hash utility

* Update changelog message

* Remove CacheKeyManager

* Reword docstrings

* Finalize update of ConfigSet

* Reframe implementation towards a invalidation strategy more than the full key derivation

* Fix module renaming and keep tests dry

* Fix typo in prefix method

* Simplify implementation to use persistent_cache_id

* Simplify id namespace calculation fc7560f
steveny91 added a commit that referenced this pull request Sep 17, 2025
#21316) (#21351)

* Add support for customizable cache keys to store log cursor

* Add changelog

* Remove PersistentCacheKey lefotver from AgentCheck

* Fix docstring mistakes

* Reduce cache key scope to checks with the same name

* Add test to cover restart behavior

* Address most of the comments on the PR

* Fix imports in tests

* Replace manual sorthing by mutable_hash utility

* Update changelog message

* Remove CacheKeyManager

* Reword docstrings

* Finalize update of ConfigSet

* Reframe implementation towards a invalidation strategy more than the full key derivation

* Fix module renaming and keep tests dry

* Fix typo in prefix method

* Simplify implementation to use persistent_cache_id

* Simplify id namespace calculation

(cherry picked from commit fc7560f)

Co-authored-by: Juanpe Araque <[email protected]>
Co-authored-by: Steven Yuen <[email protected]>
@AAraKKe AAraKKe mentioned this pull request Sep 23, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants