[GDB-13346] Enhance CloudWatch monitoring: add new alarms and refactor existing ones #129

simonzhekoff · 2025-11-07T05:09:35Z

Description

This PR enhances the CloudWatch monitoring configuration and improves alarm coverage and consistency across all GraphDB nodes.

Key Changes
Refactored existing alarms:
• Added treat_missing_data = "notBreaching" to prevent false positives during deployment or downtime.
• Unified comparison_operator to GreaterThanOrEqualToThreshold for consistency.
• Introduced new per-node alarms
• disk_used_percent — triggers when root disk usage exceeds the threshold.
• mem_used_percent — monitors memory utilization per instance.
• cpu_utilization — tracks CPU usage individually for each node.
• Updated CloudWatch Agent configuration
• Added InstanceId as an appended dimension to support per-node metric granularity and accurate alarm mapping.

Related Issues

[GDB-13446]

Changes

Added support for monitoring mem_used_percent metric via CloudWatch alarm and dashboard per node.
Added support for monitoring the disk_free_percent for the root ebs volume via CloudWatch alarm per node.
Added support for alarms when transitioning from in alarm state to ok state.
Changed the comparison operator to use GreaterThanOrEqualToThreshold for all alarms.
Added treat a missing data option for all alarms and set it to notBreaching.

Checklist

I have tested these changes thoroughly.
My code follows the project's coding style.
I have added appropriate comments to my code, especially in complex areas.
All new and existing tests passed locally.

…nt, and root ebs volume disk_used_percent, Optimized alarms

viktor-ribchev · 2025-11-07T13:13:08Z

modules/monitoring/alarms.tf

  threshold           = "0"
  alarm_actions       = [aws_sns_topic.graphdb_sns_topic.arn]
+  ok_actions          = [aws_sns_topic.graphdb_sns_topic.arn]
+  treat_missing_data  = "notBreaching"


why notBreaching ? if the agent is not sending data, for example because the VM has exhausted the available memory then it should trigger an alert.

viktor-ribchev · 2025-11-07T13:14:16Z

modules/monitoring/alarms.tf

+    ImageId              = data.aws_instance.by_id[each.key].ami
+    InstanceType         = data.aws_instance.by_id[each.key].instance_type
+    path                 = "/"
+    device               = "nvme0n1p1"


is the device always going to be nvme0n1p1 ?

For the nitro-based instances, yes, I will check for an automated way so it can find the right device automatically.

simonzhekoff self-assigned this Nov 7, 2025

simonzhekoff force-pushed the GDB-13346 branch 2 times, most recently from 140fa3a to 6719d12 Compare November 7, 2025 13:00

Added support for alerts per node for cpu utilization, mem_used_perce…

2dcce29

…nt, and root ebs volume disk_used_percent, Optimized alarms

simonzhekoff force-pushed the GDB-13346 branch from 6719d12 to 2dcce29 Compare November 7, 2025 13:00

simonzhekoff changed the title ~~[GDB-13346]~~ [GDB-13346] Enhance CloudWatch monitoring: add new alarms and refactor existing ones Nov 7, 2025

simonzhekoff requested a review from viktor-ribchev November 7, 2025 13:11

viktor-ribchev reviewed Nov 7, 2025

View reviewed changes

Fixed all the alarms except heap usage

e25100a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GDB-13346] Enhance CloudWatch monitoring: add new alarms and refactor existing ones #129

[GDB-13346] Enhance CloudWatch monitoring: add new alarms and refactor existing ones #129

simonzhekoff commented Nov 7, 2025 •

edited

Loading

Uh oh!

viktor-ribchev Nov 7, 2025

Uh oh!

viktor-ribchev Nov 7, 2025

Uh oh!

simonzhekoff Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[GDB-13346] Enhance CloudWatch monitoring: add new alarms and refactor existing ones #129

Are you sure you want to change the base?

[GDB-13346] Enhance CloudWatch monitoring: add new alarms and refactor existing ones #129

Conversation

simonzhekoff commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Changes

Checklist

Uh oh!

viktor-ribchev Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

viktor-ribchev Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

simonzhekoff Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simonzhekoff commented Nov 7, 2025 •

edited

Loading