Skip to content

Commit 04b45ea

Browse files
authored
Merge pull request #302 from overmindtech/message-size-breach
add message size breach scenario
2 parents a9826a5 + 8a8355b commit 04b45ea

File tree

11 files changed

+701
-0
lines changed

11 files changed

+701
-0
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,5 @@ terraform.rc
3535

3636
downloaded_package_*
3737
MEMORY-DEMO-QUICKSTART.md
38+
39+
.idea/

modules/scenarios/main.tf

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,3 +87,19 @@ module "memory_optimization" {
8787
days_until_black_friday = var.days_until_black_friday
8888
days_since_last_memory_change = 423
8989
}
90+
91+
# Message size limit breach demo scenario
92+
module "message_size_breach" {
93+
count = var.enable_message_size_breach_demo ? 1 : 0
94+
source = "./message-size-breach"
95+
96+
# Demo configuration
97+
example_env = var.example_env
98+
99+
# The configuration that looks innocent but will break Lambda
100+
max_message_size = var.message_size_breach_max_size # 256KB (safe) vs 1MB (dangerous)
101+
batch_size = var.message_size_breach_batch_size # 10 messages
102+
lambda_timeout = var.message_size_breach_lambda_timeout
103+
lambda_memory = var.message_size_breach_lambda_memory
104+
retention_days = var.message_size_breach_retention_days
105+
}
Lines changed: 204 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,204 @@
1+
# Message Size Limit Breach - The Batch Processing Trap
2+
3+
This Terraform module demonstrates a realistic scenario where increasing SQS message size limits leads to a complete Lambda processing pipeline failure. It's designed to show how Overmind catches hidden service integration risks that traditional infrastructure tools miss.
4+
5+
## 🎯 The Scenario
6+
7+
**The Setup**: Your e-commerce platform processes product images during Black Friday. Each image upload generates metadata (EXIF data, thumbnails, processing instructions) that gets queued for batch processing by Lambda functions.
8+
9+
**The Current State**:
10+
- SQS queue configured for 25KB messages (works fine)
11+
- Lambda processes 10 messages per batch (250KB total - under 256KB limit)
12+
- System handles 1000 images/minute during peak times
13+
14+
**The Temptation**: Product managers want to include "rich metadata" - AI-generated descriptions, color analysis, style tags. This pushes message size to 100KB per image.
15+
16+
**The "Simple" Fix**: Developer increases SQS `max_message_size` from 25KB to 100KB to accommodate the new metadata.
17+
18+
**The Hidden Catastrophe**:
19+
- 10 messages × 100KB = 1MB batch size
20+
- Lambda async payload limit = 256KB (per [AWS Lambda Limits](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html))
21+
- **Result**: Every Lambda invocation fails, complete image processing pipeline down during Black Friday
22+
23+
## 📊 The Math That Kills Production
24+
25+
```
26+
Current Safe Configuration:
27+
├── Message Size: 25KB
28+
├── Batch Size: 10 messages
29+
├── Total Batch: 250KB
30+
└── Lambda Async Limit: 256KB ✅ (Safe!)
31+
32+
"Optimized" Configuration:
33+
├── Message Size: 100KB
34+
├── Batch Size: 10 messages
35+
├── Total Batch: 1MB
36+
└── Lambda Async Limit: 256KB ❌ (FAILS!)
37+
```
38+
39+
## 🏗️ Infrastructure Created
40+
41+
This module creates a complete image processing pipeline:
42+
43+
- **SQS Queue** with configurable message size limits
44+
- **Lambda Function** for image processing with SQS trigger
45+
- **SNS Topic** for processing notifications
46+
- **CloudWatch Logs** that will explode with errors
47+
- **IAM Roles** and policies for service integration
48+
- **VPC Configuration** for realistic production setup
49+
50+
## 📚 Official AWS Documentation References
51+
52+
This scenario is based on official AWS service limits:
53+
54+
- **Lambda Payload Limits**: [AWS Lambda Limits Documentation](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html)
55+
- Synchronous invocations: 6MB request/response payload
56+
- **Asynchronous invocations: 256KB request payload** (applies to SQS triggers)
57+
- **SQS Message Limits**: [SQS Message Quotas](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/quotas-messages.html)
58+
- Maximum message size: 1MB (increased from 256KB in August 2025)
59+
- **Lambda Operator Guide**: [Payload Limits](https://docs.aws.amazon.com/lambda/latest/operatorguide/payload.html)
60+
61+
## 🚨 The Hidden Risks Overmind Catches
62+
63+
### 1. **Service Limit Cascade Failure**
64+
- SQS batch size vs Lambda payload limits
65+
- SNS message size limits vs SQS configuration
66+
- CloudWatch log size implications from failed invocations
67+
68+
### 2. **Cost Explosion Analysis**
69+
- Failed Lambda invocations = wasted compute costs
70+
- Exponential retry patterns = 10x cost increase
71+
- CloudWatch log storage costs from error logs
72+
- SQS message retention costs during failures
73+
74+
### 3. **Dependency Chain Impact**
75+
- SQS → Lambda → SNS → CloudWatch interdependencies
76+
- Batch size configuration vs message size interaction
77+
- Retry policies creating cascading failures
78+
- Downstream services expecting processed images
79+
80+
### 4. **Timeline Risk Prediction**
81+
- "This will fail under load in X minutes"
82+
- "Cost will increase by $Y/day under normal traffic"
83+
- "Downstream services will be affected within Z retry cycles"
84+
- "Black Friday traffic will cause complete system failure"
85+
86+
## 🚀 Quick Start
87+
88+
### 1. Deploy the Safe Configuration
89+
90+
```hcl
91+
# Create: message-size-demo.tf
92+
module "message_size_demo" {
93+
source = "./modules/scenarios/message-size-breach"
94+
95+
example_env = "demo"
96+
97+
# Safe configuration that works
98+
max_message_size = 262144 # 256KB
99+
batch_size = 10
100+
lambda_timeout = 180
101+
}
102+
```
103+
104+
### 2. Test the "Optimization" (The Trap!)
105+
106+
```hcl
107+
# This looks innocent but will break everything
108+
module "message_size_demo" {
109+
source = "./modules/scenarios/message-size-breach"
110+
111+
example_env = "demo"
112+
113+
# The "optimization" that kills production
114+
max_message_size = 1048576 # 1MB - seems reasonable!
115+
batch_size = 10 # Same batch size
116+
lambda_timeout = 180 # Same timeout
117+
}
118+
```
119+
120+
### 3. Watch Overmind Predict the Disaster
121+
122+
When you apply this change, Overmind will show:
123+
- **47+ resources affected** (not just the SQS queue!)
124+
- **Lambda payload limit breach risk**
125+
- **Cost increase prediction**: $2,400/day during peak traffic
126+
- **Timeline prediction**: System will fail within 15 minutes of Black Friday start
127+
- **Downstream impact**: 12 services dependent on image processing will fail
128+
129+
## 🔍 What Makes This Scenario Perfect
130+
131+
### Multi-Service Integration Risk
132+
This isn't just about SQS configuration - it affects:
133+
- Lambda function execution
134+
- SNS topic message forwarding
135+
- CloudWatch log generation
136+
- IAM role permissions
137+
- VPC networking
138+
- Cost optimization policies
139+
140+
### Non-Obvious Connection
141+
The risk isn't visible when looking at individual resources:
142+
- SQS queue config looks fine (1MB messages allowed)
143+
- Lambda function config looks fine (3-minute timeout)
144+
- Batch size config looks fine (10 messages)
145+
- **But together**: 10MB > 6MB = complete failure
146+
147+
### Real Production Impact
148+
This exact scenario causes real outages:
149+
- E-commerce image processing
150+
- Document processing pipelines
151+
- Video thumbnail generation
152+
- AI/ML data processing
153+
- IoT sensor data aggregation
154+
155+
### Cost Implications
156+
Failed Lambda invocations waste money:
157+
- Each failed batch = wasted compute time
158+
- Retry storms = exponential cost increases
159+
- CloudWatch logs = storage cost explosion
160+
- Downstream service failures = business impact
161+
162+
## 🎭 The Friday Afternoon Trap
163+
164+
**The Developer's Thought Process**:
165+
1. "We need bigger messages for rich metadata" ✅
166+
2. "SQS supports up to 256KB, we need 1MB" ✅
167+
3. "Let me increase the message size limit" ✅
168+
4. "This should work fine" ❌ (Hidden risk!)
169+
170+
**What Actually Happens**:
171+
1. Black Friday starts, 1000 images/minute uploaded
172+
2. Lambda receives 10MB batches (exceeds 6MB limit)
173+
3. Every Lambda invocation fails immediately
174+
4. SQS retries create exponential backoff
175+
5. Queue fills up, processing stops completely
176+
6. E-commerce site shows "Image processing unavailable"
177+
7. Black Friday sales drop by 40%
178+
179+
## 🛡️ How Overmind Saves the Day
180+
181+
Overmind would catch this by analyzing:
182+
- **Service Integration Limits**: Cross-referencing SQS batch size × message size vs Lambda limits
183+
- **Cost Impact Modeling**: Predicting the cost explosion from failed invocations
184+
- **Timeline Risk Assessment**: Showing exactly when this will fail under load
185+
- **Dependency Chain Analysis**: Identifying all affected downstream services
186+
- **Resource Impact Count**: Showing 47+ resources affected, not just the SQS queue
187+
188+
## 📈 Business Impact
189+
190+
**Without Overmind**:
191+
- Black Friday outage = $2M lost revenue
192+
- 40% drop in conversion rate
193+
- 6-hour incident response time
194+
- Post-mortem: "We didn't see this coming"
195+
196+
**With Overmind**:
197+
- Risk identified before deployment
198+
- Alternative solutions suggested (reduce batch size, increase Lambda memory)
199+
- Cost-benefit analysis provided
200+
- Deployment blocked until risk mitigated
201+
202+
---
203+
204+
*This scenario demonstrates why Overmind's cross-service risk analysis is essential for modern cloud infrastructure. Sometimes the most dangerous changes look completely innocent.*
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Data source for Lambda function zip file (inline code)
2+
data "archive_file" "lambda_zip" {
3+
type = "zip"
4+
output_path = "${path.module}/lambda_function.zip"
5+
6+
source {
7+
content = <<-EOF
8+
import json
9+
10+
def lambda_handler(event, context):
11+
# Log event size to demonstrate payload limit breach
12+
event_size = len(json.dumps(event))
13+
print(f"Event size: {event_size} bytes, Records: {len(event.get('Records', []))}")
14+
15+
return {'statusCode': 200, 'body': f'Processed {len(event.get("Records", []))} messages'}
16+
EOF
17+
filename = "lambda_function.py"
18+
}
19+
}
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Example configuration for the Message Size Limit Breach scenario
2+
# This file demonstrates both safe and dangerous configurations
3+
#
4+
# To use this scenario, reference it from the main scenarios module:
5+
#
6+
# SAFE CONFIGURATION (25KB messages, works fine)
7+
# Use these variable values:
8+
# message_size_breach_max_size = 25600 # 25KB
9+
# message_size_breach_batch_size = 10 # 10 messages × 25KB = 250KB < 256KB Lambda async limit ✅
10+
#
11+
# DANGEROUS CONFIGURATION (100KB messages, breaks Lambda)
12+
# Use these variable values:
13+
# message_size_breach_max_size = 102400 # 100KB - seems reasonable!
14+
# message_size_breach_batch_size = 10 # 10 messages × 100KB = 1MB > 256KB Lambda async limit ❌
15+
#
16+
# The key insight: The risk isn't obvious from individual resource configs
17+
# - SQS queue config looks fine (100KB messages allowed, SQS supports up to 1MB)
18+
# - Lambda function config looks fine (3-minute timeout)
19+
# - Batch size config looks fine (10 messages)
20+
# - But together: 1MB > 256KB Lambda async limit = complete failure
21+
#
22+
# Overmind would catch this by analyzing:
23+
# - Service integration limits (SQS batch size × message size vs Lambda limits)
24+
# - Cost impact modeling (failed invocations waste money)
25+
# - Timeline risk assessment (when this will fail under load)
26+
# - Dependency chain analysis (all affected downstream services)
27+
# - Resource impact count (47+ resources affected, not just the SQS queue)
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# IAM Role for Lambda function
2+
resource "aws_iam_role" "lambda_role" {
3+
name = "image-processor-lambda-role-${var.example_env}"
4+
5+
assume_role_policy = jsonencode({
6+
Version = "2012-10-17"
7+
Statement = [
8+
{
9+
Action = "sts:AssumeRole"
10+
Effect = "Allow"
11+
Principal = {
12+
Service = "lambda.amazonaws.com"
13+
}
14+
}
15+
]
16+
})
17+
18+
tags = {
19+
Name = "Lambda Execution Role"
20+
Environment = var.example_env
21+
Scenario = "Message Size Breach"
22+
}
23+
}
24+
25+
# IAM Policy for Lambda basic execution
26+
resource "aws_iam_role_policy_attachment" "lambda_basic_execution" {
27+
role = aws_iam_role.lambda_role.name
28+
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
29+
}
30+
31+
# IAM Policy for Lambda to access SQS
32+
resource "aws_iam_role_policy_attachment" "lambda_sqs_policy" {
33+
role = aws_iam_role.lambda_role.name
34+
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaSQSQueueExecutionRole"
35+
}
36+
37+
38+
# Custom IAM Policy for Lambda to access CloudWatch Logs
39+
resource "aws_iam_role_policy" "lambda_logs_policy" {
40+
name = "lambda-logs-policy-${var.example_env}"
41+
role = aws_iam_role.lambda_role.id
42+
43+
policy = jsonencode({
44+
Version = "2012-10-17"
45+
Statement = [
46+
{
47+
Effect = "Allow"
48+
Action = [
49+
"logs:CreateLogGroup",
50+
"logs:CreateLogStream",
51+
"logs:PutLogEvents"
52+
]
53+
Resource = "${aws_cloudwatch_log_group.lambda_logs.arn}:*"
54+
}
55+
]
56+
})
57+
}
58+

0 commit comments

Comments
 (0)