Skip to content
Merged
Original file line number Diff line number Diff line change
Expand Up @@ -5708,7 +5708,7 @@ spec:
x-kubernetes-int-or-string: true
type: object
persistence:
description: PersistencConfig defines options for data persistence
description: PersistenceConfig defines options for data persistence
properties:
emptyDir:
description: |-
Expand Down Expand Up @@ -6160,8 +6160,8 @@ spec:
type: object
x-kubernetes-map-type: atomic
securityConfigSecret:
description: Secret that contains the differnt yml files of
the opensearch-security config (config.yml, internal_users.yml,
description: Secret that contains the different yml files
of the opensearch-security config (config.yml, internal_users.yml,
...)
properties:
name:
Expand Down Expand Up @@ -6269,6 +6269,11 @@ spec:
a CA and certificates for the cluster to use, if false
secrets with existing certificates must be supplied
type: boolean
rotateDaysBeforeExpiry:
default: -1
description: Automatically rotate certificates before
they expire, set to -1 to disable
type: integer
secret:
description: Optional, name of a TLS secret that contains
ca.crt, tls.key and tls.crt data. If ca.crt is in a
Expand Down Expand Up @@ -6326,6 +6331,11 @@ spec:
perNode:
description: Configure transport node certificate
type: boolean
rotateDaysBeforeExpiry:
default: -1
description: Automatically rotate certificates before
they expire, set to -1 to disable
type: integer
secret:
description: Optional, name of a TLS secret that contains
ca.crt, tls.key and tls.crt data. If ca.crt is in a
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ spec:
{{- end}}
- args:
- --health-probe-bind-address=:8081
- --metrics-bind-address=127.0.0.1:8080
- --metrics-bind-address={{ .Values.manager.metricsBindAddress }}
- --leader-elect
{{- if .Values.manager.watchNamespace }}
- --watch-namespace={{ .Values.manager.watchNamespace }}
Expand Down
2 changes: 2 additions & 0 deletions charts/opensearch-operator/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,8 @@ manager:
# watch objects in the desired namespace. Defaults is to watch all namespaces.
watchNamespace:

metricsBindAddress: 127.0.0.1:8080

# Install the Custom Resource Definitions with Helm
installCRDs: true

Expand Down
57 changes: 37 additions & 20 deletions docs/designs/monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,23 +11,23 @@
- [Task list](#task-list)

## Goals
This component has 2 goals:
This component has two goals:
* Monitor OpenSearch clusters managed by the operator and the Operator itself.
* Supplying a getting started Monitoring and Alerting solution, allowing the user to immediately view the metrics from the first goal and have built in alerts from them
* Supplying a getting-started Monitoring and Alerting solution, allowing the user to immediately view the metrics from the first goal and have built-in alerts from them

## Overview

### OpenSearch Monitoring
The most popular open source monitoring format today is Prometheus. The latter works by periodically pulling a http endpoint containing the metrics and their values in a format called [Prometheus Text Based Exposition format](https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md).
OpenSearch doesn't contain a native Prometheus text format endpoint (yet), there for the OpenSearch Operator will install an OpenSearch Plugin called [opensearch-prometheus-exporter](https://github.com/aparo/opensearch-prometheus-exporter) which exposes per node its Node Metrics and on a specific node it will also exposes Cluster and Indices metrics.
The most popular open source monitoring format today is Prometheus. The latter works by periodically pulling an HTTP endpoint containing the metrics and their values in a format called [Prometheus Text Based Exposition format](https://github.com/prometheus/docs/blob/main/content/docs/instrumenting/exposition_formats.md).
OpenSearch doesn't contain a native Prometheus text format endpoint (yet), therefore the OpenSearch Operator will install an OpenSearch Plugin called [opensearch-prometheus-exporter](https://github.com/aparo/opensearch-prometheus-exporter) which exposes per node its Node Metrics, and on a specific node it will also expose Cluster and Indices metrics.

### Operator Monitoring
The operator is a Go-based process, which will expose its metrics (per component) using the Prometheus Text Based Exposition format on default endpoint.

### Getting Started Monitoring and Alerting solution
The most popular open source solution for monitoring and alerting today is Prometheus - serving as the time-series database the metrics will be written to and read from (queried) and as the alerting engine executing alerts. The most popular solution for viewing metrics on dashboards is Grafana. The OpenSearch operator will install them both and more:
The most popular open source solution for monitoring and alerting today is Prometheusserving as the time-series database the metrics will be written to and read from (queried) and as the alerting engine executing alerts. The most popular solution for viewing metrics on dashboards is Grafana. The OpenSearch operator will install them both and more:
* Prometheus: Will pull metrics from OpenSearch and OpenSearch Operator and write them into its own time series database.
* Alert Manager: A component from Prometheus project which handle de-dup and notifications.
* Alert Manager: A component from the Prometheus project which handles deduplication and notifications.
* Grafana: the UI for viewing the metrics, will come with pre-installed dashboards for those metrics.

Prometheus will be managed by Prometheus Operator, which will be installed by the helm charts of OpenSearch Operator.
Expand Down Expand Up @@ -86,7 +86,9 @@ Prometheus will be managed by Prometheus Operator, which will be installed by th

### OpenSearch Controller Metrics
#### Default Go Metrics

Default prometheus metrics sent by the go app. Here is a list of these metrics:

```
go_gc_duration_seconds
go_gc_duration_seconds_sum
Expand Down Expand Up @@ -123,33 +125,48 @@ promhttp_metric_handler_requests_total{code="200"} 0
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
```

The simple way to enable them is described here: [Instrumenting a go application for prometheus](https://prometheus.io/docs/guides/go-application/).

These metrics will be collected from the controller.

#### Custom Metrics

The third group of metrics, which could be collected from the controller. Suggested metrics:
| Metric | Description |
| ------------- | ------------- |
| os_restart_total | Number of times a node has restarted |
|os_cluster_management_state_info | Management state used by the cluster |
| os_storage_info | Number of nodes using emphimeral or persistent storage |
| os_redundancy_policy_info | Redundancy policy used by the cluster |
| os_index_retention_seconds | Number of seconds that documents are | retained per policy operation |
| os_defined_delete_namespaces_total | Number of defined namespaces deleted per index policy |
| os_misconfigured_memory_resources_info | Number of nodes with misconfigured memory resources |
The third group of metrics, which is collected from the controller.

Implemented metrics:

# Note: 0 - Green, 1 - Yellow, 2 - Red

| Metrics | Description |
|------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|
| opensearch_operator_cluster_info | An info metric containing the cluster name, namespace, and version. |
| opensearch_operator_cluster_tls_certificate_remaining_days | Days until the TLS certificate expires. |
| opensearch_operator_cluster_health | Health status of the cluster. The value can be 0=green, 1=yellow, 2=red, or -1=unknown. |
| opensearch_operator_cluster_shards | The number of shards in the cluster by status. The `status` label can be `active`, `relocating`, `initializing`, or `unassigned`. |

Suggested future metrics:

| Metric | Description |
|------------------------------------------------|--------------------------------------------------------------------|
| opensearch_restart_total | Number of times a node has restarted |
| opensearch_cluster_management_state_info | Management state used by the cluster |
| opensearch_storage_info | Number of nodes using ephemeral or persistent storage |
| opensearch_redundancy_policy_info | Redundancy policy used by the cluster |
| opensearch_index_retention_seconds | Number of seconds that documents are retained per policy operation |
| opensearch_defined_delete_namespaces_total | Number of defined namespaces deleted per index policy |
| opensearch_misconfigured_memory_resources_info | Number of nodes with misconfigured memory resources |

### OpenSearch Node Metrics

The [opensearch-prometheus-exporter](https://github.com/aparo/opensearch-prometheus-exporter) plugin includes metrics for each Node, and also Cluster level metrics and Index level metrics.

## Task list

- [ ] Default metrics publication
- [ ] Prometheus-plugin integration
- [ ] Opensearch controller metrics publication
- [ ] Deployment of the service-monitor for every clusters
- [x] Default metrics publication
- [x] Prometheus-plugin integration
- [x] Opensearch controller metrics publication
- [ ] Deployment of the service-monitor for every cluster
- [ ] Deployment of the Prometheus-operator
- [ ] Prometheus-operator integration
- [ ] Grafana dashboards development
Expand Down
2 changes: 1 addition & 1 deletion opensearch-operator/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Build the manager binary
FROM --platform=$BUILDPLATFORM golang:1.24.4 as builder
FROM --platform=$BUILDPLATFORM golang:1.24.4 AS builder

WORKDIR /workspace
# Copy the Go Modules manifests
Expand Down
16 changes: 12 additions & 4 deletions opensearch-operator/api/v1/opensearch_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ type NodePool struct {
InitContainers []corev1.Container `json:"initContainers,omitempty"`
}

// PersistencConfig defines options for data persistence
// PersistenceConfig defines options for data persistence
type PersistenceConfig struct {
PersistenceSource `json:","`
}
Expand Down Expand Up @@ -248,7 +248,11 @@ type TlsConfigTransport struct {
// If set to true the operator will generate a CA and certificates for the cluster to use, if false secrets with existing certificates must be supplied
Generate bool `json:"generate,omitempty"`
// Configure transport node certificate
PerNode bool `json:"perNode,omitempty"`
PerNode bool `json:"perNode,omitempty"`
// Automatically rotate certificates before they expire, set to -1 to disable
//+kubebuilder:default=-1
RotateDaysBeforeExpiry int `json:"rotateDaysBeforeExpiry,omitempty"`
//
TlsCertificateConfig `json:",omitempty"`
// Allowed Certificate DNs for nodes, only used when existing certificates are provided
NodesDn []string `json:"nodesDn,omitempty"`
Expand All @@ -258,7 +262,11 @@ type TlsConfigTransport struct {

type TlsConfigHttp struct {
// If set to true the operator will generate a CA and certificates for the cluster to use, if false secrets with existing certificates must be supplied
Generate bool `json:"generate,omitempty"`
Generate bool `json:"generate,omitempty"`
// Automatically rotate certificates before they expire, set to -1 to disable
//+kubebuilder:default=-1
RotateDaysBeforeExpiry int `json:"rotateDaysBeforeExpiry,omitempty"`
//
TlsCertificateConfig `json:",omitempty"`
}

Expand All @@ -276,7 +284,7 @@ type TlsSecret struct {
}

type SecurityConfig struct {
// Secret that contains the differnt yml files of the opensearch-security config (config.yml, internal_users.yml, ...)
// Secret that contains the different yml files of the opensearch-security config (config.yml, internal_users.yml, ...)
SecurityconfigSecret corev1.LocalObjectReference `json:"securityConfigSecret,omitempty"`
// TLS Secret that contains a client certificate (tls.key, tls.crt, ca.crt) with admin rights in the opensearch cluster. Must be set if transport certificates are provided by user and not generated
AdminSecret corev1.LocalObjectReference `json:"adminSecret,omitempty"`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5708,7 +5708,7 @@ spec:
x-kubernetes-int-or-string: true
type: object
persistence:
description: PersistencConfig defines options for data persistence
description: PersistenceConfig defines options for data persistence
properties:
emptyDir:
description: |-
Expand Down Expand Up @@ -6160,8 +6160,8 @@ spec:
type: object
x-kubernetes-map-type: atomic
securityConfigSecret:
description: Secret that contains the differnt yml files of
the opensearch-security config (config.yml, internal_users.yml,
description: Secret that contains the different yml files
of the opensearch-security config (config.yml, internal_users.yml,
...)
properties:
name:
Expand Down Expand Up @@ -6269,6 +6269,11 @@ spec:
a CA and certificates for the cluster to use, if false
secrets with existing certificates must be supplied
type: boolean
rotateDaysBeforeExpiry:
default: -1
description: Automatically rotate certificates before
they expire, set to -1 to disable
type: integer
secret:
description: Optional, name of a TLS secret that contains
ca.crt, tls.key and tls.crt data. If ca.crt is in a
Expand Down Expand Up @@ -6326,6 +6331,11 @@ spec:
perNode:
description: Configure transport node certificate
type: boolean
rotateDaysBeforeExpiry:
default: -1
description: Automatically rotate certificates before
they expire, set to -1 to disable
type: integer
secret:
description: Optional, name of a TLS secret that contains
ca.crt, tls.key and tls.crt data. If ca.crt is in a
Expand Down
2 changes: 2 additions & 0 deletions opensearch-operator/controllers/opensearchController.go
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,8 @@ func (r *OpenSearchClusterReconciler) Reconcile(ctx context.Context, req ctrl.Re
if err != nil {
return ctrl.Result{}, err
}

helpers.DeleteClusterMetrics(req.Namespace, r.Instance.Name)
}
return ctrl.Result{}, nil
}
Expand Down
3 changes: 3 additions & 0 deletions opensearch-operator/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ import (
"os"
"strconv"

"github.com/Opster/opensearch-k8s-operator/opensearch-operator/pkg/helpers"
"sigs.k8s.io/controller-runtime/pkg/cache"
metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"
"sigs.k8s.io/controller-runtime/pkg/webhook"
Expand Down Expand Up @@ -135,6 +136,8 @@ func main() {
os.Exit(1)
}

helpers.RegisterMetrics()

if err = (&controllers.OpenSearchClusterReconciler{
Client: mgr.GetClient(),
Scheme: mgr.GetScheme(),
Expand Down
Loading