Skip to content

Conversation

@gysel
Copy link
Contributor

@gysel gysel commented Sep 29, 2025

Description

  • Add basic Prometheus metrics to track the state of the managed clusters.
  • Add flag to automatically rotate generated certificates

This MR could be combined with #1086 to automatically rotate certs without any downtime.

Issues Resolved

Check List

  • Commits are signed per the DCO using --signoff
  • Unittest added for the new/changed functionality and all unit tests are successful
  • Customer-visible features documented
  • No linter warnings (make lint)

If CRDs are changed:

  • CRD YAMLs updated (make manifests) and also copied into the helm chart
  • Changes to CRDs documented

Please refer to the PR guidelines before submitting this pull request.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Michael Gysel <[email protected]>
Signed-off-by: Michael Gysel <[email protected]>
Signed-off-by: Michael Gysel <[email protected]>
@gysel
Copy link
Contributor Author

gysel commented Sep 29, 2025

While I'm planning to deploy the current version of the code into pro for our use-case, a bit of polishing is likely needed before this can be merged.

Some of my open questions are:

  • Should I also expose the CA cert in opensearch_tls_certificate_remaining_days?
  • Is rotateDaysBeforeExpiry a good name for the flag? Should I enable it by default?
  • I'm a bit lost with unit tests. Do you think it makes sense to add tests for the certificate renewal?

What else did I miss and should look into?

@gysel
Copy link
Contributor Author

gysel commented Sep 29, 2025

Note to myself: I need to review the "TLS" section in the user guide.

@synhershko
Copy link
Collaborator

Possibly related: #1086

@gysel
Copy link
Contributor Author

gysel commented Sep 30, 2025

Possibly related: #1086

It is definitely related! I'm looking forward to use the hot reloading together with these enhancements.

}

// Generate node cert, sign it and put it into secret
nodeSecret, err := r.client.GetSecret(nodeSecretName, namespace)
Copy link
Contributor

@josedev-union josedev-union Oct 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nodeSecret, err := r.client.GetSecret(nodeSecretName, namespace)
nodeSecret, err := r.client.GetSecret(nodeSecretName, namespace)
if !k8serrors.IsNotFound(err) {
r.logger.Error(err, "Failed to get secret for transport certificate")
return err
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, see e86a879. But I had to add an err != nil && as otherwise the happy-path wouldn't work anymore.

Comment on lines 301 to 308
if certRenewal {
nodeSecret.Data = nodeCert.SecretData(ca)
} else {
nodeSecret = corev1.Secret{ObjectMeta: metav1.ObjectMeta{Name: nodeSecretName, Namespace: namespace}, Type: corev1.SecretTypeTLS, Data: nodeCert.SecretData(ca)}
if err := ctrl.SetControllerReference(r.instance, &nodeSecret, r.client.Scheme()); err != nil {
return err
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the same set corev1.Secret regardless of certRenewal

Suggested change
if certRenewal {
nodeSecret.Data = nodeCert.SecretData(ca)
} else {
nodeSecret = corev1.Secret{ObjectMeta: metav1.ObjectMeta{Name: nodeSecretName, Namespace: namespace}, Type: corev1.SecretTypeTLS, Data: nodeCert.SecretData(ca)}
if err := ctrl.SetControllerReference(r.instance, &nodeSecret, r.client.Scheme()); err != nil {
return err
}
}
nodeSecret = corev1.Secret{ObjectMeta: metav1.ObjectMeta{Name: nodeSecretName, Namespace: namespace}, Type: corev1.SecretTypeTLS, Data: nodeCert.SecretData(ca)}
if !certRenewal {
if err := ctrl.SetControllerReference(r.instance, &nodeSecret, r.client.Scheme()); err != nil {
return err
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your version is a lot simpler but I need to test whether this correctly retains the ownerReferences on the secret set by .SetControllerReference().

Copy link
Contributor Author

@gysel gysel Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your code does not retain the owner references in the metadata in a renewal. I now only create a new secret when the certificate is created initially. See 47f02fd

}
}

_, err = r.client.CreateSecret(&nodeSecret)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateSecret is needed for renewal case, isn't it?

Copy link
Contributor Author

@gysel gysel Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, .CreateSecret() internally uses .ReconcileResource() which handles both cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it uses reconciler.StatePresent which handles both cases.

}

// Generate node cert, sign it and put it into secret
nodeSecret, err := r.client.GetSecret(nodeSecretName, namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only Not Found error is acceptable for going foward the reconciling. I'd check error type earlier and return err if it is another error type than Not Found as I suggested the change above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, see e86a879.

Comment on lines +6324 to +6328
rotateDaysBeforeExpiry:
default: -1
description: Automatically rotate certificates before
they expire, set to -1 to disable
type: integer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer enabling this in default. Is there any concern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I don't have a strong opinion here. I will definitely enable this in all our deployment. Maybe we could leave it disabled by default for now and enable it by default with the v3.0.0 update?

But that should probably be a decision made by the maintainers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this change will be merged to main and released as v3.0.
@synhershko wdyt?

Comment on lines 558 to 566
if certRenewal {
nodeSecret.Data = nodeCert.SecretData(ca)
} else {
nodeSecret = corev1.Secret{ObjectMeta: metav1.ObjectMeta{Name: nodeSecretName, Namespace: namespace}, Type: corev1.SecretTypeTLS, Data: nodeCert.SecretData(ca)}
if err := ctrl.SetControllerReference(r.instance, &nodeSecret, r.client.Scheme()); err != nil {
return err
}
}
_, err = r.client.CreateSecret(&nodeSecret)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.
I'd use UpdateSecret for renewal case.
Also we can use the same code snippet for setting nodeSecret as I suggested changes above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateSecret should be fine. See my comment above.

@gysel gysel requested a review from josedev-union October 13, 2025 09:35
Comment on lines 27 to 58
ClusterHealth = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "opensearch_cluster_health",
Help: "Health status of the cluster. 0=red, 1=yellow, 2=green, -1=unknown",
}, []string{
"namespace", "opensearch_cluster",
})
ActiveShards = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "opensearch_cluster_shards_active",
Help: "The number of active primary and replica shards.",
}, []string{
"namespace", "opensearch_cluster",
})
RelocatingShards = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "opensearch_cluster_shards_relocating",
Help: "The number of shards that are currently relocating.",
}, []string{
"namespace", "opensearch_cluster",
})
InitializingShards = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "opensearch_cluster_shards_initializing",
Help: "The number of shards that are currently initializing.",
}, []string{
"namespace", "opensearch_cluster",
})
UnassignedShards = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "opensearch_cluster_shards_unassigned",
Help: "The number of shards that are currently unassigned.",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metrics

  • opensearch_cluster_health
  • opensearch_cluster_shards_active
  • opensearch_cluster_shards_relocating
  • opensearch_cluster_shards_initializing
  • opensearch_cluster_shards_unassigned

are already available in opensearch-prometheus-exporter plugin.
(opensearch_cluster_shards_number{cluster="$cluster",type="$shard_type"}
and opensearch_cluster_status{cluster="$cluster"})

opensearch_tls_certificate_remaining_days and opensearch_cluster_info custom metrics will be useful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help verifying I just exported the metrics of my current test cluster running OpenSearch 3.2.0 having metrics enabled:

$ curl -k https://localhost:9200/_prometheus/metrics |grep opensearch_cluster_shards_number
# TYPE opensearch_cluster_shards_number gauge
opensearch_cluster_shards_number{cluster="opensearch-cluster",type="unassigned",} 0.0
opensearch_cluster_shards_number{cluster="opensearch-cluster",type="active",} 159.0
opensearch_cluster_shards_number{cluster="opensearch-cluster",type="relocating",} 0.0
opensearch_cluster_shards_number{cluster="opensearch-cluster",type="initializing",} 0.0
opensearch_cluster_shards_number{cluster="opensearch-cluster",type="active_primary",} 69.0

and

$ curl -k https://localhost:9200/_prometheus/metrics |grep cluster_status
opensearch_cluster_status{cluster="opensearch-cluster",} 0.0

# Note: 0 - Green, 1 - Yellow, 2 - Red

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points! Thanks for the feedback.

These metrics [...] are already available in opensearch-prometheus-exporter plugin.

I tried getting the Prometheus exporter plugin up and running (without the Prometheus operator) and failed to do so as I did not want to use metrics scraping with username/password authentication.
I think that a small set of basic cluster metrics in the Operator makes sense. But the full suite of metrics should definitely be exposed by the exporter.

I'll rename the operator metrics to be clearly distinct from the metrics exposed by the exporter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the metrics in 49987b4 and c83d04a.

gysel added 2 commits October 15, 2025 09:05
…'t have name conflicts with the opensearch-prometheus-exporter plugin

Switch health codes to be aligned with the exporter codes (0=green, 1=yellow, 2=red, -1=unknown)

Signed-off-by: Michael Gysel <[email protected]>
Copy link
Contributor

@josedev-union josedev-union left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve the conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 👀 In Review

Development

Successfully merging this pull request may close these issues.

5 participants