Skip to content

Conversation

@simonzhekoff
Copy link
Contributor

@simonzhekoff simonzhekoff commented Nov 7, 2025

Description

This PR enhances the CloudWatch monitoring configuration and improves alarm coverage and consistency across all GraphDB nodes.

Key Changes
Refactored existing alarms:
• Added treat_missing_data = "notBreaching" to prevent false positives during deployment or downtime.
• Unified comparison_operator to GreaterThanOrEqualToThreshold for consistency.
• Introduced new per-node alarms
• disk_used_percent — triggers when root disk usage exceeds the threshold.
• mem_used_percent — monitors memory utilization per instance.
• cpu_utilization — tracks CPU usage individually for each node.
• Updated CloudWatch Agent configuration
• Added InstanceId as an appended dimension to support per-node metric granularity and accurate alarm mapping.

Related Issues

[GDB-13446]

Changes

  • Added support for monitoring mem_used_percent metric via CloudWatch alarm and dashboard per node.
  • Added support for monitoring the disk_free_percent for the root ebs volume via CloudWatch alarm per node.
  • Added support for alarms when transitioning from in alarm state to ok state.
  • Changed the comparison operator to use GreaterThanOrEqualToThreshold for all alarms.
  • Added treat a missing data option for all alarms and set it to notBreaching.

Checklist

  • I have tested these changes thoroughly.
  • My code follows the project's coding style.
  • I have added appropriate comments to my code, especially in complex areas.
  • All new and existing tests passed locally.

@simonzhekoff simonzhekoff self-assigned this Nov 7, 2025
@simonzhekoff simonzhekoff force-pushed the GDB-13346 branch 2 times, most recently from 140fa3a to 6719d12 Compare November 7, 2025 13:00
…nt, and root ebs volume disk_used_percent, Optimized alarms
@simonzhekoff simonzhekoff changed the title [GDB-13346] [GDB-13346] Enhance CloudWatch monitoring: add new alarms and refactor existing ones Nov 7, 2025
threshold = "0"
alarm_actions = [aws_sns_topic.graphdb_sns_topic.arn]
ok_actions = [aws_sns_topic.graphdb_sns_topic.arn]
treat_missing_data = "notBreaching"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why notBreaching ? if the agent is not sending data, for example because the VM has exhausted the available memory then it should trigger an alert.

ImageId = data.aws_instance.by_id[each.key].ami
InstanceType = data.aws_instance.by_id[each.key].instance_type
path = "/"
device = "nvme0n1p1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the device always going to be nvme0n1p1 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the nitro-based instances, yes, I will check for an automated way so it can find the right device automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants