Skip to content

Conversation

@theodorehreuter
Copy link
Contributor

@theodorehreuter theodorehreuter commented Apr 28, 2025

Proposed Changes

This PR is filling in some gaps in alarm systems that were exposed by some infra failures.

  • Adding a CRITICAL alarm and WARNING alarm to stac-server for the dead letter queue
  • Adding a CRITICAL alarm and WARNING alarm for stac-server OpenSearch cluster health

One outstanding question - optional deployment of alarms. I'm choosing to have them deployed automatically but open to suggestions for why they could be optionally deployed.

Checklist

  • I have deployed and validated this change
  • Changelog
    • I have added my changes to the changelog
    • No changelog entry is necessary
  • README migration
    • I have added any migration steps to the Readme
    • No migration is necessary

@theodorehreuter theodorehreuter marked this pull request as draft April 28, 2025 20:04
resource "aws_cloudwatch_metric_alarm" "critical_stac_server_dlq_alarm" {
alarm_name = "CRITICAL: ${local.name_prefix}-stac-server-dead-letter SQS DLQ Critical Alarm"
alarm_description = "CRITICAL: 10 or more messages are persisting in the ${local.name_prefix}-stac-server SQS dead letter queue"
evaluation_periods = 5
Copy link
Contributor Author

@theodorehreuter theodorehreuter Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be high. Thought is that if a bunch of messages are there for over 5 minutes its really properly stuck there, thus a CRITICAL alarm

@theodorehreuter theodorehreuter marked this pull request as ready for review April 30, 2025 15:56
deploy_stac_server_outside_vpc = var.deploy_stac_server_outside_vpc
fd_web_acl_id = var.deploy_waf_rule ? module.base_infra.web_acl_id : var.ext_web_acl_id
warning_sns_topic_arn = module.base_infra.warning_sns_topic_arn
critical_sns_topic_arn = module.base_infra.critical_sns_topic_arn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow the user to toggle whether these stac-server alarms are deployed or not via boolean flag; automatically deploying alarms can be handy but they may prefer to manage all tracked dimensions themselves, too.

Suggest using the same approach used in the cirrus module for consistency:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added deploy_alarms trigger similar to cirrus module

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants