-
Notifications
You must be signed in to change notification settings - Fork 40
NETOBSERV-2326: new alerting API in FlowCollector #1790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@jotak: This pull request references NETOBSERV-2326 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
e7100fe
to
f3778a5
Compare
26deda2
to
7a3ef29
Compare
@jotak: This pull request references NETOBSERV-2326 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@jotak: This pull request references NETOBSERV-2326 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@jotak: This pull request references NETOBSERV-2326 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
New images:
They will expire after two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:d9a83e9 make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-d9a83e9 Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-d9a83e9
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
# alertGroups: | ||
# - name: TooManyDrops | ||
# alerts: | ||
# - thresholds: | ||
# critical: "15" | ||
# warning: "10" | ||
# info: "5" | ||
# - thresholds: | ||
# critical: "15" | ||
# warning: "10" | ||
# info: "5" | ||
# grouping: PerNode | ||
# groupingDirection: ByDestination | ||
# - thresholds: | ||
# critical: "15" | ||
# warning: "10" | ||
# info: "5" | ||
# grouping: PerNamespace | ||
# groupingDirection: BySource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should provide some defaults here
@stleerh maybe you have some ideas ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't got at first look that those are customs alerts 🙃
Maybe we should rename alertGroups
to make it explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, it's "semi-custom alerts": it's tweaking some parameters of the alerts that netobserv can generate, but at the end of the day it's still quite opinionated (netobserv controls the underlying promql). Users who would like truly custom alerts can create their own from scratch.
Also I called it "alertGroups" because inside that, there's again a list of alerts. I don't want to have alerts.alerts.something
in the spec path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see this, as an alternative:
alerts:
- template: TooManyDrops
groupings:
- group: global
thresholds:
critical: "15"
warning: "10"
info: "5"
- group: PerNode
direction: ByDestination
thresholds:
critical: "15"
warning: "10"
info: "5"
would that sound better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO alternate seems better than the original one, especially using template
as name instead of simply generic name
.
Few other suggestions:
- set default thresholds for each template and let user override.
- have
groupBy
instead ofgroup
defaults to global if not specified. - Having
direction
in this config feels like mixing alerting and flows concept together. Does this field in combination with group: PerNode, translate toDstK8S_Name = Node
condition in Promql?
also about:
We need to take a decision:
Either we had those metrics by default
Or we do not enable the drops alerts by default
Or we only enable those metrics by default when the drops feature is enabled
I vote for # 3 to enable those metrics and alerts by default when drop features is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aligned with @memodi on this 🚀
Everything could be optionnal appart from the template
field to list alerts you want to use. The rest is about overriding defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is already like you are suggesting, there are default alerts provided, that users can override, and global is already the default grouping :-)
The sample yaml here is an override.
On fields naming, I'm also thinking about changing "groupings" for "variants" (these are variations of the same template).
Having direction in this config feels like mixing alerting and flows concept together. Does this field in combination with group: PerNode, translate to DstK8S_Name = Node condition in Promql?
Yes, metrics/alerts are still based on flows and the concept of direction doesn't disappear. Yes also to your second point, or to be very accurate, direction=ByDestination,group=PerNode translates into by (DstK8S_HostName)
in promql.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually maybe I misunderstood your remark, direction
here is indeed not the same as the Flow direction that we have in flows. Flow direction is whether the flow was observed from source or from destination. It's not what it is here. So maybe direction
is a poor naming choice here, indeed, because it's getting overloaded...
- new API in FlowCollector spec.processor.metrics.alertGroups - validation hook: validate alerts wrt enabled features, enabled metrics, etc. - in order to leverage FlowCollector helper functions from the validation hooks, I had to move many functions from internal/pkg/helper to api/flowcollector (without it, there would be cyclic dependencies) - some code is also moved from internal/pkg/metrics (default include lists...) to api/flowcollector, so that validation webhook can leverage them - move existing PrometheusRule creation funcs to internal/pkg/metrics - create conversion functions to convert the new API into PrometheusRules resources - initial alertGroup: TooManyDrops, that works with PacketDrops feature. 3 alerts are enabled by default: From+To Namespace, From Node and To Node.
- add threshold info - add labels info
- Allow to provide a threshold per severity - Introduce 'lowVolumeThreshold' to eliminate background noise in alerts - Better naming of alerts
/retest |
As a side matter, the dashboards have been getting more and more overloaded as we've added metrics, and started to look as a catch-all. This is worsening with the added metrics. So I'm cleaning that up: - Remove app/infra traffic in single stats panels (keep them in line charts) - Better sort the single stats panels - Reorganize "Traffic rates" section by splitting per node/namespace/workload as new sections - When there are "twin charts" bps/pps, keep only the bps one
/retest e2e-operator |
@memodi: The
The following commands are available to trigger optional jobs:
Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
/test e2e-operator |
New images:
They will expire after two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:272e2c3 make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-272e2c3 Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-272e2c3
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
Description
NOTE: in the current state, when you enable the PacketDrops feature, you will get some warnings because the new default alerts, related to drops, require metrics that are not enabled by default. Here is the metrics includeList that I use to have a valid config:
The added ones are: workload_ingress_packets_total, node_drop_packets_total, node_ingress_packets_total
We need to take a decision:
TODO
Dependencies
Checklist
If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.