Skip to content

NETOBSERV-2184 improve cache & reconcile events #1517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

Conversation

jpinsonneau
Copy link
Contributor

@jpinsonneau jpinsonneau commented May 15, 2025

Description

Gaining around 20% on a 10 nodes cluster

On the left, this PR taking 42.06MB and the right main branch taking 51.66MB
image

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labeled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Copy link

openshift-ci bot commented May 15, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign oliviercazade for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jpinsonneau jpinsonneau requested review from OlivierCazade and jotak and removed request for OlivierCazade May 15, 2025 15:07
@jpinsonneau jpinsonneau added ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. needs-review Tells that the PR needs a review labels May 15, 2025
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:f99e992
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-f99e992
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-f99e992

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:f99e992 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-f99e992

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-f99e992
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

Copy link

openshift-ci bot commented May 15, 2025

@jpinsonneau: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-operator 9c03e03 link false /test e2e-operator

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jpinsonneau jpinsonneau marked this pull request as draft May 16, 2025 10:11
@jpinsonneau
Copy link
Contributor Author

Moving back to draft as it requires some changes after testing on a bigger cluster

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 19, 2025
@jpinsonneau jpinsonneau added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 19, 2025
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:1646b12
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-1646b12
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-1646b12

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:1646b12 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-1646b12

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-1646b12
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 19, 2025
@jpinsonneau
Copy link
Contributor Author

@@ -1,13 +1,14 @@
module github.com/netobserv/network-observability-operator

go 1.23.0
go 1.24.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should hold the bump to go 1.24 until our downstream go builder is ready, currently it's still lagging behind afaik
Is the upgrade needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the k8s.io/client-go v0.33.1 update forced me to do so but we can do a two steps update

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I know I delayed the k8s 0.33 bump for that reason too (dependabot and konflux keep opening PRs about that)

@@ -1,7 +1,7 @@
ARG BUILDVERSION

# Build the manager binary
FROM brew.registry.redhat.io/rh-osbs/openshift-golang-builder:v1.23 as builder
FROM brew.registry.redhat.io/rh-osbs/openshift-golang-builder:v1.24 as builder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that exists at the moment (?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you check that ? 🤔

if err != nil {
return err
}
copier.Copy(out, obj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to check how this copier performed versus the old hack, and it seems, pretty badly :(

Creating those benchmarks:

func setup() corev1.ConfigMap {
	m := corev1.ConfigMap{
		ObjectMeta: v1.ObjectMeta{
			Labels: map[string]string{},
		},
		Data: map[string]string{},
	}
	for i := 0; i < 20; i++ {
		m.Labels[fmt.Sprintf("key-%d", i)] = fmt.Sprintf("value-%d", i)
		m.Data[fmt.Sprintf("key-%d", i)] = fmt.Sprintf("value-%d", i)
	}
	return m
}

func Benchmark_Copier(b *testing.B) {
	in := setup()
	for i := 0; i < b.N; i++ {
		out := corev1.ConfigMap{}
		_ = copier.Copy(&out, &in)
	}
}

func Benchmark_OldHack(b *testing.B) {
	in := setup()
	for i := 0; i < b.N; i++ {
		out := corev1.ConfigMap{}
		_ = copyInto(&in, &out)
	}
}

Then, running them:

Benchmark_Copier-8    	   23798	     50697 ns/op	   19888 B/op	     529 allocs/op
Benchmark_OldHack-8   	 9953776	       130.8 ns/op	     288 B/op	       1 allocs/op

... which makes the hack almost 400 times faster!
Which makes sense I guess as copier does a deep copy whereas the hack does some sort of reassignment trick if I understand correctly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@jpinsonneau jpinsonneau May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the hack will always be faster but I suspect it involves a memory leak in the end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just ran some pprof alongside with a comparison between copier vs old hack, and couldn't find a memory leak. If no evidence of a leak I think we should keep the hack as it's more efficient. I'm currently running a NDH test with your PR ( https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/65121/rehearse-65121-periodic-ci-openshift-eng-ocp-qe-perfscale-ci-netobserv-perf-tests-netobserv-aws-4.19-nightly-x86-node-density-heavy-25nodes/1925115161513824256 )

What we can do is, after that, rollback this single change (with copier) and run again a NDH

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jotak
Copy link
Member

jotak commented May 20, 2025

@jotak any idea on how to make the linter happy ? https://github.com/netobserv/network-observability-operator/actions/runs/15115971335/job/42486600090?pr=1517

I must say I don't understand the error; but it seems related to bumping to go 1.24, as I don't get it with 1.23
So if we stick to 1.23, that should pass ... until next time

@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label May 21, 2025
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:b7bf8de
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-b7bf8de
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-b7bf8de

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:b7bf8de make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-b7bf8de

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-b7bf8de
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@@ -138,10 +126,54 @@ func (c *Client) updateCache(ctx context.Context, key string, watcher watch.Inte
}

func (c *Client) setToCache(key string, obj runtime.Object) error {
// cleanup unecessary fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you move this part into the GVKInfo struct e.g. as a new Cleanup callback?
Because the narrowcache client is designed in a way to allow callers to extend what types are managed, by adding their own GVKInfo if needed, and this big switch kind of break the pattern.

Also, I think emptying SetManagedFields could be called regardless of the type, as it's in ObjectMeta.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

57a1cab

I kept the types to have the autocompletion in vscode for future addition

@jotak
Copy link
Member

jotak commented May 21, 2025

Just realized we don't track the operator metrics... and unfortunately the per-pod metrics that we track are flawed when there's pods restart (which seemed to happen e.g. here as it shows several operator pods)
I've opened a PR to add the operator metrics to our tests: openshift-eng/ocp-qe-perfscale-ci#742

In the meantime, I've compared with another run that I've found without pod restart:
Previous run: RSS = 115979810.1
New run (this PR): RSS = 105546001.1

So that's a -9% RSS

@jotak jotak added needs-changes To be added to denote PR needs changes or some questions/comments to be addressed and removed needs-review Tells that the PR needs a review labels May 27, 2025
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 2, 2025
@jpinsonneau jpinsonneau removed the needs-changes To be added to denote PR needs changes or some questions/comments to be addressed label Jun 2, 2025
@jpinsonneau jpinsonneau requested a review from jotak June 2, 2025 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants