NETOBSERV-2326: new alerting API in FlowCollector #1790

jotak · 2025-07-25T11:57:31Z

Description

new API in FlowCollector spec.processor.metrics.alertGroups
- validation hook: validate alerts wrt enabled features, enabled metrics, etc.
- in order to leverage FlowCollector helper functions from the validation hooks, I had to move many functions from internal/pkg/helper to api/flowcollector (without it, there would be cyclic dependencies)
- some code is also moved from internal/pkg/metrics (default include lists...) to api/flowcollector, so that validation webhook can leverage them
move existing PrometheusRule creation funcs to internal/pkg/metrics
create conversion functions to convert the new API into PrometheusRules resources
initial alertGroup: TooManyDrops, that works with PacketDrops feature. 3 alerts are enabled by default: From+To Namespace, From Node and To Node.

NOTE: in the current state, when you enable the PacketDrops feature, you will get some warnings because the new default alerts, related to drops, require metrics that are not enabled by default. Here is the metrics includeList that I use to have a valid config:

  processor:
    metrics:
      includeList:
      - "node_ingress_bytes_total"
      - "node_ingress_packets_total"
      - "workload_ingress_bytes_total"
      - "workload_ingress_packets_total"
      - "namespace_flows_total"
      - "namespace_drop_packets_total"
      - "node_drop_packets_total"
      - "namespace_rtt_seconds"

The added ones are: workload_ingress_packets_total, node_drop_packets_total, node_ingress_packets_total

We need to take a decision:

Either we had those metrics by default
Or we do not enable the drops alerts by default
Or we only enable those metrics by default when the drops feature is enabled

TODO

Decide how to deal with the warnings in the default config

Dependencies

NETOBSERV-2328: new Network Health page network-observability-console-plugin#983 (not required for having the alerts, but it provides the new health page that is built on top of that)

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

openshift-ci-robot · 2025-07-25T11:57:35Z

openshift-ci · 2025-07-25T11:57:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign msherif1234 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-08-14T10:05:55Z

openshift-ci-robot · 2025-08-14T10:12:48Z

@jotak: This pull request references NETOBSERV-2326 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

Description

new API in FlowCollector spec.processor.metrics.alertGroups

validation hook: validate alerts wrt enabled features, enabled metrics, etc.

in order to leverage FlowCollector helper functions from the validation hooks, I had to move many functions from internal/pkg/helper to api/flowcollector (without it, there would be cyclic dependencies)

some code is also moved from internal/pkg/metrics (default include lists...) to api/flowcollector, so that validation webhook can leverage them

move existing PrometheusRule creation funcs to internal/pkg/metrics

create conversion functions to convert the new API into PrometheusRules resources

initial alertGroup: TooManyDrops, that works with PacketDrops feature. 3 alerts are enabled by default: From+To Namespace, From Node and To Node.

NOTE: in the current state, when you enable the PacketDrops feature, you will get some warnings because the new default alerts, related to drops, require metrics that are not enabled by default. Here is the metrics includeList that I use to have a valid config:
 processor:
   metrics:
     includeList:
     - "node_ingress_bytes_total"
     - "node_ingress_packets_total"
     - "workload_ingress_bytes_total"
     - "workload_ingress_packets_total"
     - "namespace_flows_total"
     - "namespace_drop_packets_total"
     - "node_drop_packets_total"
     - "namespace_rtt_seconds"
The added ones are: workload_ingress_packets_total, node_drop_packets_total, node_ingress_packets_total

We need to take a decision:

Either we had those metrics by default

Or we do not enable the drops alerts by default

Or we only enable those metrics by default when the drops feature is enabled

Dependencies

NETOBSERV-2328: new Network Health page network-observability-console-plugin#983 (not required for having the alerts, but it provides the new health page that is built on top of that)

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).

Does this PR require product documentation?

If so, make sure the JIRA epic is labeled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.

Does this PR require a product release notes entry?

If so, fill in "Release Note Text" in the JIRA.

Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.

If so, make sure it is described in the JIRA ticket.

QE requirements (check 1 from the list):

Standard QE validation, with pre-merge tests unless stated otherwise.

Regression tests only (e.g. refactoring with no user-facing change).

No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-08-14T10:13:51Z

@jotak: This pull request references NETOBSERV-2326 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

Description

new API in FlowCollector spec.processor.metrics.alertGroups

validation hook: validate alerts wrt enabled features, enabled metrics, etc.

in order to leverage FlowCollector helper functions from the validation hooks, I had to move many functions from internal/pkg/helper to api/flowcollector (without it, there would be cyclic dependencies)

some code is also moved from internal/pkg/metrics (default include lists...) to api/flowcollector, so that validation webhook can leverage them

move existing PrometheusRule creation funcs to internal/pkg/metrics

create conversion functions to convert the new API into PrometheusRules resources

initial alertGroup: TooManyDrops, that works with PacketDrops feature. 3 alerts are enabled by default: From+To Namespace, From Node and To Node.

NOTE: in the current state, when you enable the PacketDrops feature, you will get some warnings because the new default alerts, related to drops, require metrics that are not enabled by default. Here is the metrics includeList that I use to have a valid config:
 processor:
   metrics:
     includeList:
     - "node_ingress_bytes_total"
     - "node_ingress_packets_total"
     - "workload_ingress_bytes_total"
     - "workload_ingress_packets_total"
     - "namespace_flows_total"
     - "namespace_drop_packets_total"
     - "node_drop_packets_total"
     - "namespace_rtt_seconds"
The added ones are: workload_ingress_packets_total, node_drop_packets_total, node_ingress_packets_total

We need to take a decision:

Either we had those metrics by default

Or we do not enable the drops alerts by default

Or we only enable those metrics by default when the drops feature is enabled

TODO

Decide how to deal with the warnings in the default config

Dependencies

NETOBSERV-2328: new Network Health page network-observability-console-plugin#983 (not required for having the alerts, but it provides the new health page that is built on top of that)

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).

Does this PR require product documentation?

If so, make sure the JIRA epic is labeled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.

Does this PR require a product release notes entry?

If so, fill in "Release Note Text" in the JIRA.

Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.

If so, make sure it is described in the JIRA ticket.

QE requirements (check 1 from the list):

Standard QE validation, with pre-merge tests unless stated otherwise.

Regression tests only (e.g. refactoring with no user-facing change).

No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

github-actions · 2025-08-14T14:44:11Z

New images:

quay.io/netobserv/network-observability-operator:d9a83e9
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-d9a83e9
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-d9a83e9

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:d9a83e9 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-d9a83e9

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-d9a83e9
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

api/flowcollector/v1beta2/flowcollector_types.go

api/flowcollector/v1beta2/flowcollector_validation_webhook.go

jpinsonneau · 2025-08-19T08:25:20Z

config/samples/flows_v1beta2_flowcollector.yaml

+      # alertGroups:
+      # - name: TooManyDrops
+      #   alerts:
+      #   - thresholds:
+      #       critical: "15"
+      #       warning: "10"
+      #       info: "5"
+      #   - thresholds:
+      #       critical: "15"
+      #       warning: "10"
+      #       info: "5"
+      #     grouping: PerNode
+      #     groupingDirection: ByDestination
+      #   - thresholds:
+      #       critical: "15"
+      #       warning: "10"
+      #       info: "5"
+      #     grouping: PerNamespace
+      #     groupingDirection: BySource


I guess we should provide some defaults here

@stleerh maybe you have some ideas ?

I didn't got at first look that those are customs alerts 🙃
Maybe we should rename alertGroups to make it explicit

I don't know, it's "semi-custom alerts": it's tweaking some parameters of the alerts that netobserv can generate, but at the end of the day it's still quite opinionated (netobserv controls the underlying promql). Users who would like truly custom alerts can create their own from scratch.

Also I called it "alertGroups" because inside that, there's again a list of alerts. I don't want to have alerts.alerts.something in the spec path.

I can see this, as an alternative:

alerts: - template: TooManyDrops groupings: - group: global thresholds: critical: "15" warning: "10" info: "5" - group: PerNode direction: ByDestination thresholds: critical: "15" warning: "10" info: "5"

would that sound better?

IMO alternate seems better than the original one, especially using template as name instead of simply generic name .
Few other suggestions:

set default thresholds for each template and let user override.

have groupBy instead of group defaults to global if not specified.

Having direction in this config feels like mixing alerting and flows concept together. Does this field in combination with group: PerNode, translate to DstK8S_Name = Node condition in Promql?

also about:

We need to take a decision:

Either we had those metrics by default
Or we do not enable the drops alerts by default
Or we only enable those metrics by default when the drops feature is enabled

I vote for # 3 to enable those metrics and alerts by default when drop features is enabled.

I'm aligned with @memodi on this 🚀

Everything could be optionnal appart from the template field to list alerts you want to use. The rest is about overriding defaults.

this is already like you are suggesting, there are default alerts provided, that users can override, and global is already the default grouping :-)
The sample yaml here is an override.
On fields naming, I'm also thinking about changing "groupings" for "variants" (these are variations of the same template).

Having direction in this config feels like mixing alerting and flows concept together. Does this field in combination with group: PerNode, translate to DstK8S_Name = Node condition in Promql?

Yes, metrics/alerts are still based on flows and the concept of direction doesn't disappear. Yes also to your second point, or to be very accurate, direction=ByDestination,group=PerNode translates into by (DstK8S_HostName) in promql.

Actually maybe I misunderstood your remark, direction here is indeed not the same as the Flow direction that we have in flows. Flow direction is whether the flow was observed from source or from destination. It's not what it is here. So maybe direction is a poor naming choice here, indeed, because it's getting overloaded...

- new API in FlowCollector spec.processor.metrics.alertGroups - validation hook: validate alerts wrt enabled features, enabled metrics, etc. - in order to leverage FlowCollector helper functions from the validation hooks, I had to move many functions from internal/pkg/helper to api/flowcollector (without it, there would be cyclic dependencies) - some code is also moved from internal/pkg/metrics (default include lists...) to api/flowcollector, so that validation webhook can leverage them - move existing PrometheusRule creation funcs to internal/pkg/metrics - create conversion functions to convert the new API into PrometheusRules resources - initial alertGroup: TooManyDrops, that works with PacketDrops feature. 3 alerts are enabled by default: From+To Namespace, From Node and To Node.

- add threshold info - add labels info

- Allow to provide a threshold per severity - Introduce 'lowVolumeThreshold' to eliminate background noise in alerts - Better naming of alerts

jotak · 2025-09-04T07:22:38Z

/retest

As a side matter, the dashboards have been getting more and more overloaded as we've added metrics, and started to look as a catch-all. This is worsening with the added metrics. So I'm cleaning that up: - Remove app/infra traffic in single stats panels (keep them in line charts) - Better sort the single stats panels - Reorganize "Traffic rates" section by splitting per node/namespace/workload as new sections - When there are "twin charts" bps/pps, keep only the bps one

memodi · 2025-09-04T19:54:04Z

/retest e2e-operator

openshift-ci · 2025-09-04T19:54:07Z

@memodi: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test ci-bundle-noo-bundle

/test images

The following commands are available to trigger optional jobs:

/test e2e-operator

Use /test all to run all jobs.

In response to this:

/retest e2e-operator

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

memodi · 2025-09-04T19:54:18Z

/ok-to-test

memodi · 2025-09-04T19:58:08Z

/test e2e-operator

github-actions · 2025-09-04T19:58:29Z

New images:

quay.io/netobserv/network-observability-operator:272e2c3
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-272e2c3
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-272e2c3

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:272e2c3 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-272e2c3

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-272e2c3
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

codecov · 2025-09-04T21:22:02Z

Codecov Report

❌ Patch coverage is 81.00775% with 147 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.90%. Comparing base (5ae8890) to head (c181334).
⚠️ Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
api/flowcollector/v1beta2/zz_generated.deepcopy.go	2.32%	41 Missing and 1 partial ⚠️
...lector/v1beta2/flowcollector_validation_webhook.go	86.75%	16 Missing and 4 partials ⚠️
api/flowcollector/v1beta2/helper.go	84.25%	14 Missing and 3 partials ⚠️
internal/controller/ebpf/agent_controller.go	31.57%	0 Missing and 13 partials ⚠️
internal/pkg/metrics/alerts/builder.go	90.84%	7 Missing and 6 partials ⚠️
.../controller/consoleplugin/consoleplugin_objects.go	29.41%	0 Missing and 12 partials ⚠️
internal/controller/ebpf/bpfmanager-controller.go	0.00%	11 Missing ⚠️
internal/pkg/metrics/alerts/alerts.go	89.89%	8 Missing and 2 partials ⚠️
...flowcollector/v1beta2/flowcollector_alert_types.go	95.83%	2 Missing and 1 partial ⚠️
internal/controller/flp/flp_pipeline_builder.go	72.72%	0 Missing and 3 partials ⚠️
... and 3 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1790      +/-   ##
==========================================
+ Coverage   70.77%   70.90%   +0.13%     
==========================================
  Files          75       79       +4     
  Lines        9837    10182     +345     
==========================================
+ Hits         6962     7220     +258     
- Misses       2488     2561      +73     
- Partials      387      401      +14

Flag	Coverage Δ
unittests	`70.90% <81.00%> (+0.13%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
api/flowcollector/v1beta2/flowcollector_types.go	`100.00% <ø> (ø)`
api/flowmetrics/v1alpha1/flowmetric_webhook.go	`52.74% <100.00%> (+3.33%)`	⬆️
...ntroller/consoleplugin/consoleplugin_reconciler.go	`68.96% <100.00%> (ø)`
internal/controller/ebpf/agent_metrics.go	`86.46% <100.00%> (ø)`
internal/controller/flowcollector_controller.go	`74.59% <100.00%> (ø)`
internal/controller/flp/flp_common_objects.go	`90.23% <100.00%> (-1.01%)`	⬇️
internal/controller/flp/flp_controller.go	`66.81% <100.00%> (ø)`
internal/controller/flp/flp_monolith_objects.go	`84.07% <100.00%> (ø)`
internal/controller/flp/flp_monolith_reconciler.go	`56.66% <100.00%> (+0.29%)`	⬆️
internal/controller/flp/flp_transfo_objects.go	`86.36% <100.00%> (ø)`
... and 21 more

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

openshift-ci-robot added the jira/valid-reference label Jul 25, 2025

jotak force-pushed the alert-api branch 2 times, most recently from e7100fe to f3778a5 Compare August 8, 2025 09:59

jotak requested review from OlivierCazade and jpinsonneau August 11, 2025 10:34

jotak added the needs-review Tells that the PR needs a review label Aug 11, 2025

jotak mentioned this pull request Aug 11, 2025

NETOBSERV-2328: new Network Health page netobserv/network-observability-console-plugin#983

Open

10 tasks

jotak force-pushed the alert-api branch 2 times, most recently from 26deda2 to 7a3ef29 Compare August 12, 2025 17:03

jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Aug 14, 2025

jpinsonneau reviewed Aug 19, 2025

View reviewed changes

api/flowcollector/v1beta2/flowcollector_types.go Outdated Show resolved Hide resolved

jpinsonneau reviewed Aug 19, 2025

View reviewed changes

api/flowcollector/v1beta2/flowcollector_validation_webhook.go Outdated Show resolved Hide resolved

jpinsonneau reviewed Aug 19, 2025

View reviewed changes

jotak added 11 commits September 2, 2025 14:15

Alerts documentation

290b8dc

doc cleaning

2faa650

fix lint

e8fb50a

Add health annotation

b623c26

add metrics includelist sample

cb859d2

Add information in health annotation, add sample

12ff924

- add threshold info - add labels info

trivial: text update

4a3f9ab

Refactor parts of the alerting API:

7d89e1e

- Allow to provide a threshold per severity - Introduce 'lowVolumeThreshold' to eliminate background noise in alerts - Better naming of alerts

nolint cyclop

d12da39

refactor validation webhook, fix CRD doc

aeb5871

jotak force-pushed the alert-api branch from 405781b to aeb5871 Compare September 2, 2025 12:17

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 2, 2025

Address feedback on new alerting API

c7c13f4

jotak added 2 commits September 4, 2025 10:43

Update doc and sample

d0f4ef3

jotak requested review from jpinsonneau and memodi September 4, 2025 13:57

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Sep 4, 2025

NETOBSERV-2326: new alerting API in FlowCollector #1790

Are you sure you want to change the base?

NETOBSERV-2326: new alerting API in FlowCollector #1790

Uh oh!

Conversation

jotak commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Dependencies

Checklist

Uh oh!

openshift-ci-robot commented Jul 25, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Dependencies

Checklist

Uh oh!

openshift-ci bot commented Jul 25, 2025

Uh oh!

openshift-ci-robot commented Aug 14, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Dependencies

Checklist

Uh oh!

openshift-ci-robot commented Aug 14, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Dependencies

Checklist

Uh oh!

openshift-ci-robot commented Aug 14, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Dependencies

Checklist

Uh oh!

github-actions bot commented Aug 14, 2025

Uh oh!

Uh oh!

Uh oh!

jpinsonneau Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

jpinsonneau Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

jotak Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

jotak Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

memodi Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

jpinsonneau Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

jotak Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jotak Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

jotak commented Sep 4, 2025

Uh oh!

memodi commented Sep 4, 2025

Uh oh!

openshift-ci bot commented Sep 4, 2025

Uh oh!

memodi commented Sep 4, 2025

Uh oh!

memodi commented Sep 4, 2025

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

codecov bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jotak commented Jul 25, 2025 •

edited

Loading

openshift-ci-robot commented Jul 25, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Aug 14, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Aug 14, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Aug 14, 2025 •

edited by openshift-ci bot

Loading

jotak Sep 3, 2025 •

edited

Loading

codecov bot commented Sep 4, 2025 •

edited

Loading