Skip to content

Conversation

linus-elastisys
Copy link
Contributor

@linus-elastisys linus-elastisys commented Sep 16, 2025

Warning

This is a public repository, ensure not to disclose:

  • personal data beyond what is necessary for interacting with this pull request, nor
  • business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

  • kind/feature
  • kind/improvement
  • kind/deprecation
  • kind/documentation
  • kind/clean-up
  • kind/bug
  • kind/other

Optional: Mark one or more of the following that are applicable:

Important

Breaking changes should be marked kind/admin-change or kind/dev-change depending on type
Critical security fixes should be marked with kind/security

  • kind/admin-change
  • kind/dev-change
  • kind/security
  • [kind/adr](set-me)

What does this PR do / why do we need this PR?

This allows the setting of individual rollover settings for the different indices in OpenSearch. Previously a 5 GB rollover was used for all indices, even though Prometheus alert rules were set to different values. This PR introduces individual settings for the kubernetes, kubeaudit, other and authlog indices. The new default rollover values are based on the existing alerts sizes (slightly smaller).

Information to reviewers

The authlog alert size was slightly increased because it felt too tiny to set a rollover size to 1 MB, and leaving both alert and rollover at 2 MB felt like it could cause false alerts.

One caveat is that the new ISM policy (with smaller size) is only applied after the next rollover, which means that the old size will still be in effect for up to a day.

Question to reviewers: Should this be a kind/admin-change and/or have a "Platform Administrator notice" since it changes the configuration file and schema? I'm not sure if my changes here will require any action to be taken by platform admins. Also not sure if this is something that requires a migration.

Checklist

  • Proper commit message prefix on all commits
  • Change checks:
    • The change is transparent
    • The change is disruptive
    • The change requires no migration steps
    • The change requires migration steps
    • The change updates CRDs
    • The change updates the config and the schema
  • Documentation checks:
  • Metrics checks:
    • The metrics are still exposed and present in Grafana after the change
    • The metrics names didn't change (Grafana dashboards and Prometheus alerts required no updates)
    • The metrics names did change (Grafana dashboards and Prometheus alerts required an update)
  • Logs checks:
    • The logs do not show any errors after the change
  • PodSecurityPolicy checks:
    • Any changed Pod is covered by Kubernetes Pod Security Standards
    • Any changed Pod is covered by Gatekeeper Pod Security Policies
    • The change does not cause any Pods to be blocked by Pod Security Standards or Policies
  • NetworkPolicy checks:
    • Any changed Pod is covered by Network Policies
    • The change does not cause any dropped packets in the NetworkPolicy Dashboard
  • Audit checks:
    • The change does not cause any unnecessary Kubernetes audit events
    • The change requires changes to Kubernetes audit policy
  • Falco checks:
    • The change does not cause any alerts to be generated by Falco
  • Bug checks:
    • The bug fix is covered by regression tests

@linus-elastisys linus-elastisys marked this pull request as ready for review September 17, 2025 11:10
@linus-elastisys linus-elastisys requested review from a team as code owners September 17, 2025 11:10
@linus-elastisys linus-elastisys marked this pull request as draft September 17, 2025 11:12
@linus-elastisys linus-elastisys marked this pull request as ready for review September 17, 2025 11:13
@lunkan93 lunkan93 requested a review from viktor-f September 17, 2025 11:53
@linus-elastisys
Copy link
Contributor Author

I tested this by manually submitting a lot of stuff to the authlog to fill it up above the default 2 MB limit.

Before: we can see that the 2 MB limit is there.
Screenshot from 2025-09-18 11-06-30

After posting a lot of example documents using the API, there's a successful rollover of the -10 index, and a new -11 index is created when it's detected (seems like ISM policies are checked every ~5 minutes, so some overrun of the limit is expected):
Screenshot from 2025-09-18 11-10-09
Screenshot from 2025-09-18 11-10-29

Copy link
Contributor

@elastisys-staffan elastisys-staffan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested installation on my dev cluster, works as expected. Nice work! 🎉

Copy link
Contributor

@aarnq aarnq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I have a suggestion though.

Copy link
Contributor

@lunkan93 lunkan93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 👍

@linus-elastisys linus-elastisys merged commit ce9adad into main Sep 23, 2025
12 checks passed
@linus-elastisys linus-elastisys deleted the linus/rework-index-size-alerts branch September 23, 2025 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rework index size alerts and rollover thresholds in OpenSearch
4 participants