-
Couldn't load subscription status.
- Fork 293
Add Prometheus metrics and certificate rotation #1112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Michael Gysel <[email protected]>
Signed-off-by: Michael Gysel <[email protected]>
Signed-off-by: Michael Gysel <[email protected]>
Signed-off-by: Michael Gysel <[email protected]>
|
While I'm planning to deploy the current version of the code into pro for our use-case, a bit of polishing is likely needed before this can be merged. Some of my open questions are:
What else did I miss and should look into? |
Signed-off-by: Michael Gysel <[email protected]>
Signed-off-by: Michael Gysel <[email protected]>
|
Note to myself: I need to review the "TLS" section in the user guide. |
|
Possibly related: #1086 |
It is definitely related! I'm looking forward to use the hot reloading together with these enhancements. |
| } | ||
|
|
||
| // Generate node cert, sign it and put it into secret | ||
| nodeSecret, err := r.client.GetSecret(nodeSecretName, namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| nodeSecret, err := r.client.GetSecret(nodeSecretName, namespace) | |
| nodeSecret, err := r.client.GetSecret(nodeSecretName, namespace) | |
| if !k8serrors.IsNotFound(err) { | |
| r.logger.Error(err, "Failed to get secret for transport certificate") | |
| return err | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, see e86a879. But I had to add an err != nil && as otherwise the happy-path wouldn't work anymore.
| if certRenewal { | ||
| nodeSecret.Data = nodeCert.SecretData(ca) | ||
| } else { | ||
| nodeSecret = corev1.Secret{ObjectMeta: metav1.ObjectMeta{Name: nodeSecretName, Namespace: namespace}, Type: corev1.SecretTypeTLS, Data: nodeCert.SecretData(ca)} | ||
| if err := ctrl.SetControllerReference(r.instance, &nodeSecret, r.client.Scheme()); err != nil { | ||
| return err | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use the same set corev1.Secret regardless of certRenewal
| if certRenewal { | |
| nodeSecret.Data = nodeCert.SecretData(ca) | |
| } else { | |
| nodeSecret = corev1.Secret{ObjectMeta: metav1.ObjectMeta{Name: nodeSecretName, Namespace: namespace}, Type: corev1.SecretTypeTLS, Data: nodeCert.SecretData(ca)} | |
| if err := ctrl.SetControllerReference(r.instance, &nodeSecret, r.client.Scheme()); err != nil { | |
| return err | |
| } | |
| } | |
| nodeSecret = corev1.Secret{ObjectMeta: metav1.ObjectMeta{Name: nodeSecretName, Namespace: namespace}, Type: corev1.SecretTypeTLS, Data: nodeCert.SecretData(ca)} | |
| if !certRenewal { | |
| if err := ctrl.SetControllerReference(r.instance, &nodeSecret, r.client.Scheme()); err != nil { | |
| return err | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your version is a lot simpler but I need to test whether this correctly retains the ownerReferences on the secret set by .SetControllerReference().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your code does not retain the owner references in the metadata in a renewal. I now only create a new secret when the certificate is created initially. See 47f02fd
| } | ||
| } | ||
|
|
||
| _, err = r.client.CreateSecret(&nodeSecret) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UpdateSecret is needed for renewal case, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, .CreateSecret() internally uses .ReconcileResource() which handles both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it uses reconciler.StatePresent which handles both cases.
| } | ||
|
|
||
| // Generate node cert, sign it and put it into secret | ||
| nodeSecret, err := r.client.GetSecret(nodeSecretName, namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only Not Found error is acceptable for going foward the reconciling. I'd check error type earlier and return err if it is another error type than Not Found as I suggested the change above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, see e86a879.
| rotateDaysBeforeExpiry: | ||
| default: -1 | ||
| description: Automatically rotate certificates before | ||
| they expire, set to -1 to disable | ||
| type: integer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer enabling this in default. Is there any concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I don't have a strong opinion here. I will definitely enable this in all our deployment. Maybe we could leave it disabled by default for now and enable it by default with the v3.0.0 update?
But that should probably be a decision made by the maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this change will be merged to main and released as v3.0.
@synhershko wdyt?
| if certRenewal { | ||
| nodeSecret.Data = nodeCert.SecretData(ca) | ||
| } else { | ||
| nodeSecret = corev1.Secret{ObjectMeta: metav1.ObjectMeta{Name: nodeSecretName, Namespace: namespace}, Type: corev1.SecretTypeTLS, Data: nodeCert.SecretData(ca)} | ||
| if err := ctrl.SetControllerReference(r.instance, &nodeSecret, r.client.Scheme()); err != nil { | ||
| return err | ||
| } | ||
| } | ||
| _, err = r.client.CreateSecret(&nodeSecret) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
I'd use UpdateSecret for renewal case.
Also we can use the same code snippet for setting nodeSecret as I suggested changes above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateSecret should be fine. See my comment above.
Signed-off-by: Michael Gysel <[email protected]>
Signed-off-by: Michael Gysel <[email protected]>
Signed-off-by: Michael Gysel <[email protected]>
| ClusterHealth = prometheus.NewGaugeVec( | ||
| prometheus.GaugeOpts{ | ||
| Name: "opensearch_cluster_health", | ||
| Help: "Health status of the cluster. 0=red, 1=yellow, 2=green, -1=unknown", | ||
| }, []string{ | ||
| "namespace", "opensearch_cluster", | ||
| }) | ||
| ActiveShards = prometheus.NewGaugeVec( | ||
| prometheus.GaugeOpts{ | ||
| Name: "opensearch_cluster_shards_active", | ||
| Help: "The number of active primary and replica shards.", | ||
| }, []string{ | ||
| "namespace", "opensearch_cluster", | ||
| }) | ||
| RelocatingShards = prometheus.NewGaugeVec( | ||
| prometheus.GaugeOpts{ | ||
| Name: "opensearch_cluster_shards_relocating", | ||
| Help: "The number of shards that are currently relocating.", | ||
| }, []string{ | ||
| "namespace", "opensearch_cluster", | ||
| }) | ||
| InitializingShards = prometheus.NewGaugeVec( | ||
| prometheus.GaugeOpts{ | ||
| Name: "opensearch_cluster_shards_initializing", | ||
| Help: "The number of shards that are currently initializing.", | ||
| }, []string{ | ||
| "namespace", "opensearch_cluster", | ||
| }) | ||
| UnassignedShards = prometheus.NewGaugeVec( | ||
| prometheus.GaugeOpts{ | ||
| Name: "opensearch_cluster_shards_unassigned", | ||
| Help: "The number of shards that are currently unassigned.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These metrics
opensearch_cluster_healthopensearch_cluster_shards_activeopensearch_cluster_shards_relocatingopensearch_cluster_shards_initializingopensearch_cluster_shards_unassigned
are already available in opensearch-prometheus-exporter plugin.
(opensearch_cluster_shards_number{cluster="$cluster",type="$shard_type"}
and opensearch_cluster_status{cluster="$cluster"})
opensearch_tls_certificate_remaining_days and opensearch_cluster_info custom metrics will be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To help verifying I just exported the metrics of my current test cluster running OpenSearch 3.2.0 having metrics enabled:
$ curl -k https://localhost:9200/_prometheus/metrics |grep opensearch_cluster_shards_number
# TYPE opensearch_cluster_shards_number gauge
opensearch_cluster_shards_number{cluster="opensearch-cluster",type="unassigned",} 0.0
opensearch_cluster_shards_number{cluster="opensearch-cluster",type="active",} 159.0
opensearch_cluster_shards_number{cluster="opensearch-cluster",type="relocating",} 0.0
opensearch_cluster_shards_number{cluster="opensearch-cluster",type="initializing",} 0.0
opensearch_cluster_shards_number{cluster="opensearch-cluster",type="active_primary",} 69.0and
$ curl -k https://localhost:9200/_prometheus/metrics |grep cluster_status
opensearch_cluster_status{cluster="opensearch-cluster",} 0.0
# Note: 0 - Green, 1 - Yellow, 2 - RedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points! Thanks for the feedback.
These metrics [...] are already available in opensearch-prometheus-exporter plugin.
I tried getting the Prometheus exporter plugin up and running (without the Prometheus operator) and failed to do so as I did not want to use metrics scraping with username/password authentication.
I think that a small set of basic cluster metrics in the Operator makes sense. But the full suite of metrics should definitely be exposed by the exporter.
I'll rename the operator metrics to be clearly distinct from the metrics exposed by the exporter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…'t have name conflicts with the opensearch-prometheus-exporter plugin Switch health codes to be aligned with the exporter codes (0=green, 1=yellow, 2=red, -1=unknown) Signed-off-by: Michael Gysel <[email protected]>
Signed-off-by: Michael Gysel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please resolve the conflicts
Description
This MR could be combined with #1086 to automatically rotate certs without any downtime.
Issues Resolved
Check List
make lint)If CRDs are changed:
make manifests) and also copied into the helm chartPlease refer to the PR guidelines before submitting this pull request.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.