Skip to content

Commit a016685

Browse files
authored
Merge pull request #28 from sio2project/refactor
Various changes and refactors
2 parents 363887a + e66e8be commit a016685

22 files changed

+1618
-854
lines changed

DEVELOPMENT.md

Lines changed: 444 additions & 0 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 83 additions & 110 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ S3 deduplication proxy server with Filetracker protocol compatibility.
1414
- **Distributed Locking**: PostgreSQL advisory locks for distributed, high-availability deployments
1515
- **Migration Support**: Offline and live migration from old Filetracker instances
1616
- **Auto Cleanup**: Background cleaner removes unreferenced S3 objects
17-
- **Multi-bucket**: Run multiple independent buckets on different ports
17+
- **Single-instance per bucket**: Each instance handles exactly one bucket; scale horizontally with multiple instances
1818

1919
## Quick Start with Docker
2020

@@ -93,7 +93,7 @@ POSTGRES_MAX_CONNECTIONS=10
9393

9494
### Distributed Locking (PostgreSQL Advisory Locks)
9595

96-
For high-availability deployments with multiple s3dedup instances, enable PostgreSQL-based distributed locks:
96+
For distributed locking across multiple instances in high-availability setups, enable PostgreSQL-based advisory locks:
9797

9898
```
9999
LOCKS_TYPE=postgres
@@ -105,109 +105,22 @@ POSTGRES_DB=s3dedup
105105
POSTGRES_MAX_CONNECTIONS=10
106106
```
107107

108-
**Benefits of PostgreSQL Locks**:
109-
- **Distributed Locking**: Multiple s3dedup instances can safely coordinate file operations
110-
- **High Availability**: If one instance fails, others can continue with the same locks
111-
- **Load Balancing**: Multiple instances can share the same database for coordinated access
112-
- **Atomic Operations**: Prevents race conditions in concurrent file operations
108+
**When to Use**:
109+
- **Single-instance deployments**: Use default memory-based locking (LOCKS_TYPE=memory)
110+
- **Multi-instance HA deployments**: Use PostgreSQL-based locking for coordinated access
113111

114-
**How It Works**:
115-
- Uses PostgreSQL's built-in advisory locks (`pg_advisory_lock`, `pg_advisory_lock_shared`)
116-
- Lock keys are hashed to 64-bit integers for PostgreSQL's lock API
117-
- Shared locks allow concurrent reads; exclusive locks ensure serialized writes
118-
- Automatic lock release when guard is dropped (via background cleanup tasks)
119-
120-
**Note**: PostgreSQL locks require the same PostgreSQL instance used for KV storage. Connection pool is shared between both uses.
112+
**Note**: PostgreSQL locks share the connection pool with KV storage. Ensure sufficient pool size for concurrent operations. See [DEVELOPMENT.md](DEVELOPMENT.md) for implementation details.
121113

122114
### Connection Pool Sizing
123115

124-
The `POSTGRES_MAX_CONNECTIONS` setting controls the maximum number of concurrent database connections from a single s3dedup instance. This **single pool** is shared between KV storage operations and lock management.
125-
126-
**How to Choose Pool Size:**
127-
128-
```
129-
Pool Size = (Concurrent Requests × 1.5) + Lock Overhead
130-
```
131-
132-
**General Guidelines:**
133-
134-
| Deployment | Concurrency | Recommended Pool Size | Notes |
135-
|------------|-------------|----------------------|-------|
136-
| **Low** | 1-5 concurrent requests | 10 | Default, suitable for development/testing |
137-
| **Medium** | 5-20 concurrent requests | 20-30 | Small production deployments |
138-
| **High** | 20-100 concurrent requests | 50-100 | Large production deployments |
139-
| **Very High** | 100+ concurrent requests | 100-200 | Use multiple instances with load balancing |
140-
141-
**Factors to Consider:**
142-
143-
1. **Number of s3dedup Instances**
144-
- If you have N instances, each needs its own pool
145-
- Total connections = N instances × pool_size
146-
- PostgreSQL must have enough capacity for all instances
147-
- Example: 3 instances × 30 pool_size = 90 connections needed
148-
149-
2. **Lock Contention**
150-
- File operations acquire locks (1 connection per lock)
151-
- Concurrent uploads/downloads increase lock pressure
152-
- Add 20% overhead for lock operations
153-
- Example: 20 concurrent requests → pool_size = (20 × 1.5) + overhead ≈ 35
154-
155-
3. **Database Configuration**
156-
- Check PostgreSQL `max_connections` setting
157-
- Reserve connections for maintenance, monitoring, backups
158-
- Example: PostgreSQL with 200 max_connections:
159-
- Reserve 10 for maintenance
160-
- If 3 s3dedup instances: (200 - 10) / 3 ≈ 63 per instance
161-
162-
4. **Memory Usage Per Connection**
163-
- Each connection uses ~5-10 MB of memory
164-
- Pool size 50 = ~250-500 MB per instance
165-
- Monitor actual usage and adjust accordingly
166-
167-
**Example Configurations:**
168-
169-
**Development (1 instance, low throughput):**
170-
```json
171-
"postgres": {
172-
"pool_size": 10
173-
}
174-
```
175-
176-
**Production (3 instances, medium throughput):**
177-
```json
178-
"postgres": {
179-
"pool_size": 30
180-
}
181-
```
182-
With PostgreSQL `max_connections = 100`:
183-
- 3 × 30 = 90 connections (10 reserved)
184-
185-
**High-Availability (5 instances, high throughput with PostgreSQL max_connections = 200):**
186-
```json
187-
"postgres": {
188-
"pool_size": 35
189-
}
190-
```
191-
- 5 × 35 = 175 connections (25 reserved for other operations)
116+
The `POSTGRES_MAX_CONNECTIONS` setting controls the maximum number of concurrent database connections. This pool is shared between KV storage operations and lock management.
192117

193-
**Monitoring and Tuning:**
118+
**Quick Start Recommendations:**
119+
- **Development**: `POSTGRES_MAX_CONNECTIONS=10`
120+
- **Small Production (1-3 instances)**: `POSTGRES_MAX_CONNECTIONS=20-30`
121+
- **Large Production (5+ instances)**: `POSTGRES_MAX_CONNECTIONS=50-100`
194122

195-
Monitor these metrics to optimize pool size:
196-
197-
1. **Connection Utilization**: Check if connections are frequently exhausted
198-
```sql
199-
SELECT count(*) FROM pg_stat_activity WHERE datname = 's3dedup';
200-
```
201-
202-
2. **Lock Wait Times**: Monitor if operations wait for available connections
203-
3. **Memory Usage**: Watch instance memory as pool size increases
204-
205-
**Scaling Strategy:**
206-
207-
- **Start Conservative**: Begin with pool_size = 10-20
208-
- **Monitor Usage**: Track connection utilization over 1-2 weeks
209-
- **Increase Gradually**: Increment by 10-20 when you see high utilization
210-
- **Scale Horizontally**: Instead of very large pools (>100), use more instances with moderate pools
123+
For detailed pool sizing guidance, monitoring strategies, and tuning considerations, see [DEVELOPMENT.md](DEVELOPMENT.md#connection-pool-sizing).
211124

212125
### Config File
213126

@@ -224,6 +137,47 @@ docker run -d \
224137

225138
Environment variables override config file values.
226139

140+
## Deployment and Scaling
141+
142+
### Single-Instance per Bucket Architecture
143+
144+
s3dedup follows a **single-bucket-per-instance** design pattern, consistent with 12-factor application principles:
145+
146+
- **One Instance = One Bucket**: Each s3dedup instance manages exactly one S3 bucket and serves one Filetracker endpoint
147+
- **Horizontal Scaling**: For multiple buckets, run multiple s3dedup instances (one per bucket)
148+
- **Simplified Configuration**: Cleaner config files, easier to reason about, better for container orchestration
149+
150+
### High-Availability Deployments
151+
152+
For a single bucket with high availability, run multiple instances with PostgreSQL locks and shared database:
153+
154+
```bash
155+
# All instances share the same PostgreSQL database and use PostgreSQL locks
156+
docker run -d \
157+
--name s3dedup-ha-1 \
158+
-p 8001:8080 \
159+
-e BUCKET_NAME=files \
160+
-e LISTEN_PORT=8080 \
161+
-e KVSTORAGE_TYPE=postgres \
162+
-e LOCKS_TYPE=postgres \
163+
-e POSTGRES_HOST=postgres-db \
164+
-e POSTGRES_USER=postgres \
165+
-e POSTGRES_PASSWORD=password \
166+
-e POSTGRES_DB=s3dedup \
167+
-e S3_ENDPOINT=http://minio:9000 \
168+
-e S3_ACCESS_KEY=minioadmin \
169+
-e S3_SECRET_KEY=minioadmin \
170+
ghcr.io/sio2project/s3dedup:latest server --env
171+
172+
# Repeat for instances 2, 3, etc., on different ports
173+
```
174+
175+
**Benefits of HA Setup**:
176+
- **Load Balancing**: Requests can be distributed across multiple instances
177+
- **Fault Tolerance**: If one instance fails, others continue serving requests
178+
- **Coordinated Access**: PostgreSQL locks ensure safe concurrent file operations
179+
- **Shared Metadata**: Single database prevents data inconsistency
180+
227181
## Migration
228182

229183
> **📖 Complete Migration Guide**: See [docs/migration.md](docs/migration.md) for comprehensive migration instructions
@@ -344,27 +298,43 @@ Compatible with Filetracker protocol v2:
344298
- `PUT /ft/files/{path}` - Upload file
345299
- `DELETE /ft/files/{path}` - Delete file
346300

347-
## Building from Source
348301

349-
```bash
350-
# Build binary
351-
cargo build --release
302+
## Testing
303+
304+
For comprehensive testing guide, see **[DEVELOPMENT.md](DEVELOPMENT.md)**.
352305

353-
# Build Docker image
354-
docker build -t s3dedup:1.0.0-dev .
306+
Quick start:
307+
308+
```bash
309+
# Run unit tests (no external dependencies)
310+
cargo test --lib
355311

356-
# Run tests
312+
# Run all tests (requires PostgreSQL + MinIO)
313+
docker-compose up -d
314+
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/s3dedup_test"
357315
cargo test
316+
docker-compose down
358317
```
359318

360319
## Development
361320

321+
See **[DEVELOPMENT.md](DEVELOPMENT.md)** for detailed development instructions including:
322+
323+
- Building from source
324+
- Running tests with different configurations
325+
- PostgreSQL advisory lock implementation details
326+
- Contributing guidelines
327+
- Performance considerations
328+
329+
Quick start:
330+
362331
```bash
363-
# Run with Docker Compose (includes MinIO)
332+
# Run with Docker Compose (includes PostgreSQL + MinIO)
364333
docker-compose up
365334

366-
# Run locally
367-
cargo run -- server --config config.json
335+
# In another terminal, run tests
336+
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/s3dedup_test"
337+
cargo test
368338
```
369339

370340
## Architecture
@@ -378,10 +348,13 @@ cargo run -- server --config config.json
378348
- PostgreSQL locks: Distributed coordination, suitable for multi-instance HA setups
379349
- **Cleaner**: Background worker that removes unreferenced S3 objects
380350

381-
For detailed architecture documentation, see [docs/deduplication.md](docs/deduplication.md).
351+
For detailed architecture documentation, see:
352+
- [docs/deduplication.md](docs/deduplication.md) - Deduplication architecture and performance
353+
- [DEVELOPMENT.md](DEVELOPMENT.md) - Lock implementation details and code architecture
382354

383355
## Documentation
384356

357+
- **[Development Guide](DEVELOPMENT.md)** - Building, testing, lock implementation details, and contributing
385358
- **[Migration Guide](docs/migration.md)** - Migrating from Filetracker v2.1+ (offline and live migration strategies)
386359
- **[Deduplication Architecture](docs/deduplication.md)** - How content-based deduplication works, data flows, and performance characteristics
387360

config.json

Lines changed: 23 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -3,30 +3,28 @@
33
"level": "debug",
44
"json": false
55
},
6-
"buckets": [
7-
{
8-
"name": "bucket1",
9-
"address": "0.0.0.0",
10-
"port": 3000,
11-
"kvstorage_type": "sqlite",
12-
"sqlite": {
13-
"path": "db/kv.db",
14-
"pool_size": 10
15-
},
16-
"locks_type": "memory",
17-
"s3storage_type": "minio",
18-
"minio": {
19-
"endpoint": "http://localhost:9000",
20-
"access_key": "minioadmin",
21-
"secret_key": "minioadmin",
22-
"force_path_style": true
23-
},
24-
"cleaner": {
25-
"enabled": false,
26-
"interval_seconds": 3600,
27-
"batch_size": 1000,
28-
"max_deletes_per_run": 10000
29-
}
6+
"kvstorage_type": "sqlite",
7+
"sqlite": {
8+
"path": "db/kv.db",
9+
"pool_size": 10
10+
},
11+
"locks_type": "memory",
12+
"bucket": {
13+
"name": "default",
14+
"address": "0.0.0.0",
15+
"port": 8080,
16+
"s3storage_type": "minio",
17+
"minio": {
18+
"endpoint": "http://localhost:9000",
19+
"access_key": "minioadmin",
20+
"secret_key": "minioadmin",
21+
"force_path_style": true
22+
},
23+
"cleaner": {
24+
"enabled": false,
25+
"interval_seconds": 3600,
26+
"batch_size": 1000,
27+
"max_deletes_per_run": 10000
3028
}
31-
]
29+
}
3230
}

0 commit comments

Comments
 (0)