-
Notifications
You must be signed in to change notification settings - Fork 32
NETOBSERV-2184 improve cache & reconcile events #1517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
New images:
They will expire after two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:f99e992 make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-f99e992 Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-f99e992
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
@jpinsonneau: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Moving back to draft as it requires some changes after testing on a bigger cluster |
New images:
They will expire after two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:1646b12 make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-1646b12 Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-1646b12
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
@jotak any idea on how to make the linter happy ? |
@@ -1,13 +1,14 @@ | |||
module github.com/netobserv/network-observability-operator | |||
|
|||
go 1.23.0 | |||
go 1.24.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should hold the bump to go 1.24 until our downstream go builder is ready, currently it's still lagging behind afaik
Is the upgrade needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the k8s.io/client-go v0.33.1
update forced me to do so but we can do a two steps update
- first to v0.32.5 but I didn't saw perfs improvments there kubernetes/client-go@v0.32.3...v0.32.5
- then to v0.33.1 kubernetes/client-go@v0.32.5...v0.33.1 as soon as we can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I know I delayed the k8s 0.33 bump for that reason too (dependabot and konflux keep opening PRs about that)
@@ -1,7 +1,7 @@ | |||
ARG BUILDVERSION | |||
|
|||
# Build the manager binary | |||
FROM brew.registry.redhat.io/rh-osbs/openshift-golang-builder:v1.23 as builder | |||
FROM brew.registry.redhat.io/rh-osbs/openshift-golang-builder:v1.24 as builder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that exists at the moment (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you check that ? 🤔
pkg/narrowcache/client.go
Outdated
if err != nil { | ||
return err | ||
} | ||
copier.Copy(out, obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to check how this copier
performed versus the old hack, and it seems, pretty badly :(
Creating those benchmarks:
func setup() corev1.ConfigMap {
m := corev1.ConfigMap{
ObjectMeta: v1.ObjectMeta{
Labels: map[string]string{},
},
Data: map[string]string{},
}
for i := 0; i < 20; i++ {
m.Labels[fmt.Sprintf("key-%d", i)] = fmt.Sprintf("value-%d", i)
m.Data[fmt.Sprintf("key-%d", i)] = fmt.Sprintf("value-%d", i)
}
return m
}
func Benchmark_Copier(b *testing.B) {
in := setup()
for i := 0; i < b.N; i++ {
out := corev1.ConfigMap{}
_ = copier.Copy(&out, &in)
}
}
func Benchmark_OldHack(b *testing.B) {
in := setup()
for i := 0; i < b.N; i++ {
out := corev1.ConfigMap{}
_ = copyInto(&in, &out)
}
}
Then, running them:
Benchmark_Copier-8 23798 50697 ns/op 19888 B/op 529 allocs/op
Benchmark_OldHack-8 9953776 130.8 ns/op 288 B/op 1 allocs/op
... which makes the hack almost 400 times faster!
Which makes sense I guess as copier does a deep copy whereas the hack does some sort of reassignment trick if I understand correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw the hack is still there in controller-runtime https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/cache/internal/cache_reader.go#L93
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the hack will always be faster but I suspect it involves a memory leak in the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just ran some pprof alongside with a comparison between copier vs old hack, and couldn't find a memory leak. If no evidence of a leak I think we should keep the hack as it's more efficient. I'm currently running a NDH test with your PR ( https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/65121/rehearse-65121-periodic-ci-openshift-eng-ocp-qe-perfscale-ci-netobserv-perf-tests-netobserv-aws-4.19-nightly-x86-node-density-heavy-25nodes/1925115161513824256 )
What we can do is, after that, rollback this single change (with copier) and run again a NDH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must say I don't understand the error; but it seems related to bumping to go 1.24, as I don't get it with 1.23 |
New images:
They will expire after two weeks. To deploy this build: # Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:b7bf8de make deploy
# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-b7bf8de Or as a Catalog Source: apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: netobserv-dev
namespace: openshift-marketplace
spec:
sourceType: grpc
image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-b7bf8de
displayName: NetObserv development catalog
publisher: Me
updateStrategy:
registryPoll:
interval: 1m |
pkg/narrowcache/client.go
Outdated
@@ -138,10 +126,54 @@ func (c *Client) updateCache(ctx context.Context, key string, watcher watch.Inte | |||
} | |||
|
|||
func (c *Client) setToCache(key string, obj runtime.Object) error { | |||
// cleanup unecessary fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you move this part into the GVKInfo
struct e.g. as a new Cleanup
callback?
Because the narrowcache
client is designed in a way to allow callers to extend what types are managed, by adding their own GVKInfo
if needed, and this big switch kind of break the pattern.
Also, I think emptying SetManagedFields
could be called regardless of the type, as it's in ObjectMeta
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the types to have the autocompletion in vscode for future addition
Just realized we don't track the operator metrics... and unfortunately the per-pod metrics that we track are flawed when there's pods restart (which seemed to happen e.g. here as it shows several operator pods) In the meantime, I've compared with another run that I've found without pod restart: So that's a -9% RSS |
Description
annotationsGaining around 20% on a 10 nodes cluster
On the left, this PR taking 42.06MB and the right main branch taking 51.66MB

Dependencies
n/a
Checklist
If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.