Metric timer stops publishing and doesn't recover

**What happened**: When kubelet returns a 401 or a 500, the timer doesn't recover from the exception and stops reporting kubernetes metrics

**What you expected to happen**: The metrics should still be publishing on schedule when an error occurs

**How to reproduce it (as minimally and precisely as possible)**:

stop kubelet for some some seconds on the kubernetes node and start it again ```systemctl stop kubelet && sleep 30 && systemctl start kubelet``
see errors in the splunk-metrics-splunk-kubernetes-metrics where exception is thrown and timer is detached

```
2024-09-10 03:09:27 +0000 [error]: #0 Unexpected error raised. Stopping the timer. title=:cadvisor_metric_scraper error_class=RestClient::Unauthorized error="401 Unauthorized"
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/rest-client-2.1.0/lib/restclient/abstract_response.rb:249:in `exception_with_response'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/rest-client-2.1.0/lib/restclient/abstract_response.rb:129:in `return!'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/rest-client-2.1.0/lib/restclient/request.rb:836:in `process_result'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/rest-client-2.1.0/lib/restclient/request.rb:743:in `block in transmit'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/ruby/net/http.rb:966:in `start'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/rest-client-2.1.0/lib/restclient/request.rb:727:in `transmit'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/rest-client-2.1.0/lib/restclient/request.rb:163:in `execute'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/rest-client-2.1.0/lib/restclient/request.rb:63:in `execute'
  2024-09-10 03:09:27 +0000 [error]: #0 /opt/app-root/src/gem/fluent-plugin-kubernetes-metrics-1.2.3/lib/fluent/plugin/in_kubernetes_metrics.rb:728:in `scrape_cadvisor_metrics'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/fluentd-1.15.3/lib/fluent/plugin_helper/timer.rb:80:in `on_timer'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/cool.io-1.7.1/lib/cool.io/loop.rb:88:in `run_once'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/cool.io-1.7.1/lib/cool.io/loop.rb:88:in `run'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/fluentd-1.15.3/lib/fluent/plugin_helper/event_loop.rb:93:in `block in start'
  2024-09-10 03:09:27 +0000 [error]: #0 /usr/share/gems/gems/fluentd-1.15.3/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2024-09-10 03:09:27 +0000 [error]: #0 Timer detached. title=:cadvisor_metric_scraper
```

**Anything else we need to know?**:
This issue is similar to https://github.com/splunk/splunk-connect-for-kubernetes/issues/493.
It had been fixed in that ticket by adding an healthcheck on the pod, but the right solution would be for the fluent plugin to recover from that exception in the http client.

**Environment**:
- Kubernetes version (use `kubectl version`): v1.27.13
- Ruby version (use `ruby --version`): 2.6.10p210
- OS (e.g: `cat /etc/os-release`): RHEL 9.2
- Splunk version: splunk-connect-for-kubernetes 1.5.4 and fluent-plugin-kubernetes-metrics 1.2.3
- Others: 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metric timer stops publishing and doesn't recover #148

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric timer stops publishing and doesn't recover #148

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions