NETOBSERV-2184 improve cache & reconcile events #1517

jpinsonneau · 2025-05-15T14:48:27Z

Description

remove unecessary info from cache
- status conditions / container info
- managed fields
- ~~annotations~~
reduce reconcile calls
update k8s go client kubernetes/client-go@v0.32.3...v0.33.1

Gaining around 20% on a 10 nodes cluster

On the left, this PR taking 42.06MB and the right main branch taking 51.66MB

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

openshift-ci · 2025-05-15T14:48:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign oliviercazade for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2025-05-15T15:12:14Z

New images:

quay.io/netobserv/network-observability-operator:f99e992
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-f99e992
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-f99e992

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:f99e992 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-f99e992

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-f99e992
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

openshift-ci · 2025-05-15T16:21:41Z

@jpinsonneau: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-operator	`9c03e03`	link	false	`/test e2e-operator`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

pkg/narrowcache/client.go

pkg/manager/manager.go

jpinsonneau · 2025-05-16T10:12:25Z

Moving back to draft as it requires some changes after testing on a bigger cluster

github-actions · 2025-05-19T12:58:15Z

New images:

quay.io/netobserv/network-observability-operator:1646b12
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-1646b12
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-1646b12

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:1646b12 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-1646b12

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-1646b12
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

jpinsonneau · 2025-05-19T14:44:16Z

@jotak any idea on how to make the linter happy ?
https://github.com/netobserv/network-observability-operator/actions/runs/15115971335/job/42486600090?pr=1517

jotak · 2025-05-20T07:07:17Z

go.mod

@@ -1,13 +1,14 @@
 module github.com/netobserv/network-observability-operator

-go 1.23.0
+go 1.24.0


We should hold the bump to go 1.24 until our downstream go builder is ready, currently it's still lagging behind afaik
Is the upgrade needed here?

the k8s.io/client-go v0.33.1 update forced me to do so but we can do a two steps update

first to v0.32.5 but I didn't saw perfs improvments there kubernetes/client-go@v0.32.3...v0.32.5

then to v0.33.1 kubernetes/client-go@v0.32.5...v0.33.1 as soon as we can

Yes I know I delayed the k8s 0.33 bump for that reason too (dependabot and konflux keep opening PRs about that)

jotak · 2025-05-20T07:07:59Z

Dockerfile.downstream

@@ -1,7 +1,7 @@
 ARG BUILDVERSION

 # Build the manager binary
-FROM brew.registry.redhat.io/rh-osbs/openshift-golang-builder:v1.23 as builder
+FROM brew.registry.redhat.io/rh-osbs/openshift-golang-builder:v1.24 as builder


I don't think that exists at the moment (?)

how do you check that ? 🤔

jotak · 2025-05-20T09:46:05Z

pkg/narrowcache/client.go

-		if err != nil {
-			return err
-		}
+		copier.Copy(out, obj)


I wanted to check how this copier performed versus the old hack, and it seems, pretty badly :(

Creating those benchmarks:

func setup() corev1.ConfigMap { m := corev1.ConfigMap{ ObjectMeta: v1.ObjectMeta{ Labels: map[string]string{}, }, Data: map[string]string{}, } for i := 0; i < 20; i++ { m.Labels[fmt.Sprintf("key-%d", i)] = fmt.Sprintf("value-%d", i) m.Data[fmt.Sprintf("key-%d", i)] = fmt.Sprintf("value-%d", i) } return m } func Benchmark_Copier(b *testing.B) { in := setup() for i := 0; i < b.N; i++ { out := corev1.ConfigMap{} _ = copier.Copy(&out, &in) } } func Benchmark_OldHack(b *testing.B) { in := setup() for i := 0; i < b.N; i++ { out := corev1.ConfigMap{} _ = copyInto(&in, &out) } }

Then, running them:

Benchmark_Copier-8 23798 50697 ns/op 19888 B/op 529 allocs/op Benchmark_OldHack-8 9953776 130.8 ns/op 288 B/op 1 allocs/op

... which makes the hack almost 400 times faster!
Which makes sense I guess as copier does a deep copy whereas the hack does some sort of reassignment trick if I understand correctly

fwiw the hack is still there in controller-runtime https://github.com/kubernetes-sigs/controller-runtime/blob/main/pkg/cache/internal/cache_reader.go#L93

yes the hack will always be faster but I suspect it involves a memory leak in the end

just ran some pprof alongside with a comparison between copier vs old hack, and couldn't find a memory leak. If no evidence of a leak I think we should keep the hack as it's more efficient. I'm currently running a NDH test with your PR ( https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/65121/rehearse-65121-periodic-ci-openshift-eng-ocp-qe-perfscale-ci-netobserv-perf-tests-netobserv-aws-4.19-nightly-x86-node-density-heavy-25nodes/1925115161513824256 )

What we can do is, after that, rollback this single change (with copier) and run again a NDH

jotak · 2025-05-20T09:50:24Z

@jotak any idea on how to make the linter happy ? https://github.com/netobserv/network-observability-operator/actions/runs/15115971335/job/42486600090?pr=1517

I must say I don't understand the error; but it seems related to bumping to go 1.24, as I don't get it with 1.23
So if we stick to 1.23, that should pass ... until next time

github-actions · 2025-05-21T08:57:40Z

New images:

quay.io/netobserv/network-observability-operator:b7bf8de
quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-b7bf8de
quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-b7bf8de

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:b7bf8de make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-b7bf8de

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-b7bf8de
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

jotak · 2025-05-21T09:11:15Z

pkg/narrowcache/client.go

@@ -138,10 +126,54 @@ func (c *Client) updateCache(ctx context.Context, key string, watcher watch.Inte
 }

 func (c *Client) setToCache(key string, obj runtime.Object) error {
+	// cleanup unecessary fields


could you move this part into the GVKInfo struct e.g. as a new Cleanup callback?
Because the narrowcache client is designed in a way to allow callers to extend what types are managed, by adding their own GVKInfo if needed, and this big switch kind of break the pattern.

Also, I think emptying SetManagedFields could be called regardless of the type, as it's in ObjectMeta.

57a1cab

I kept the types to have the autocompletion in vscode for future addition

jotak · 2025-05-21T12:59:56Z

Just realized we don't track the operator metrics... and unfortunately the per-pod metrics that we track are flawed when there's pods restart (which seemed to happen e.g. here as it shows several operator pods)
I've opened a PR to add the operator metrics to our tests: openshift-eng/ocp-qe-perfscale-ci#742

In the meantime, I've compared with another run that I've found without pod restart:
Previous run: RSS = 115979810.1
New run (this PR): RSS = 105546001.1

So that's a -9% RSS

jpinsonneau requested review from OlivierCazade and jotak and removed request for OlivierCazade May 15, 2025 15:07

jpinsonneau added ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. needs-review Tells that the PR needs a review labels May 15, 2025

jotak reviewed May 16, 2025

View reviewed changes

pkg/narrowcache/client.go Outdated Show resolved Hide resolved

jotak reviewed May 16, 2025

View reviewed changes

pkg/manager/manager.go Outdated Show resolved Hide resolved

jpinsonneau marked this pull request as draft May 16, 2025 10:11

openshift-ci bot added the do-not-merge/work-in-progress label May 16, 2025

jpinsonneau added 2 commits May 19, 2025 14:46

vendor

87d6581

improve cache & reconcile events

4b59112

jpinsonneau force-pushed the 2184 branch from 9c03e03 to 4b59112 Compare May 19, 2025 12:50

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 19, 2025

jpinsonneau added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 19, 2025

update bundle

2eeed08

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 19, 2025

fixes

a268b31

jpinsonneau force-pushed the 2184 branch from 83db187 to a268b31 Compare May 19, 2025 14:03

update helm template

de695df

jpinsonneau force-pushed the 2184 branch from 6eb44a4 to de695df Compare May 19, 2025 14:40

jotak reviewed May 20, 2025

View reviewed changes

jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 21, 2025

jotak mentioned this pull request May 21, 2025

Perf test ebpf 9667036 openshift/release#65121

Closed

jotak reviewed May 21, 2025

View reviewed changes

jotak added needs-changes To be added to denote PR needs changes or some questions/comments to be addressed and removed needs-review Tells that the PR needs a review labels May 27, 2025

restore copy hack

4fcdf8e

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 2, 2025

cleanup function

57a1cab

jpinsonneau removed the needs-changes To be added to denote PR needs changes or some questions/comments to be addressed label Jun 2, 2025

jpinsonneau requested a review from jotak June 2, 2025 09:44

NETOBSERV-2184 improve cache & reconcile events #1517

Are you sure you want to change the base?

NETOBSERV-2184 improve cache & reconcile events #1517

Uh oh!

Conversation

jpinsonneau commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Dependencies

Checklist

Uh oh!

openshift-ci bot commented May 15, 2025

Uh oh!

github-actions bot commented May 15, 2025

Uh oh!

openshift-ci bot commented May 15, 2025

Uh oh!

Uh oh!

Uh oh!

jpinsonneau commented May 16, 2025

Uh oh!

github-actions bot commented May 19, 2025

Uh oh!

jpinsonneau commented May 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpinsonneau May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jotak commented May 20, 2025

Uh oh!

github-actions bot commented May 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jotak commented May 21, 2025

Uh oh!

Uh oh!

jpinsonneau commented May 15, 2025 •

edited

Loading

jpinsonneau May 20, 2025 •

edited

Loading