This repository implements an academic research prototype for federated reinforcement learning (FRL) across multi‑cloud regions/providers to jointly learn load balancing and task scheduling policies under latency, SLO, and carbon/energy constraints.
Stack: Python · PyTorch · YAML configs · Docker · GitHub Actions · Integration hooks for Kubernetes/OpenStack · Prometheus metrics adapters
Focus: Reproducible experiments, ablation studies, and paper-ready figures.
- Federated RL (FedAvg + optional FedProx) coordinating regional PPO agents.
- Two-tier policy: (1) Load balancer picks a cloud/region; (2) Scheduler assigns to a node/pool.
- Multi-objective rewards combining latency, queueing delay, SLO violations, cost, and carbon intensity.
- Integration hooks (mock+optional live):
- Kubernetes: cluster metrics + deployment scaling via
kubernetesclient (optional, with safe stubs for dry-run). - OpenStack: Nova/Neutron stubs for VM placement decisions.
- Prometheus: metric scraping adapters.
- Kubernetes: cluster metrics + deployment scaling via
- Academic package: configs, baselines, ablations, seeding, experiment runner, result exports, and paper materials in
docs/. - Reproducibility: fixed random seeds, logged configs, deterministic ops where possible.
# 1) Create env
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# 2) Smoke test (synthetic multi-cloud simulation)
python -m src.run_experiment --config configs/experiments/small_demo.yaml
# 3) Plot results
python -m src.tools.plot_results --input results/small_demo/metrics.csv --out results/small_demo/plotsNote: Kubernetes/OpenStack hooks default to dry-run unless you set
INTEGRATION_MODE=liveand provide credentials.
federated-rl-multicloud/
├── configs/ # YAML configs for env/agents/experiments
├── docs/ # Paper materials: abstract, outline, figs (Mermaid), checklist
├── notebooks/ # Minimal notebooks to inspect logs and metrics
├── results/ # (created at runtime) experiment outputs
├── src/
│ ├── envs/ # Multi-cloud simulator
│ ├── federated/ # Aggregation server & client logic
│ ├── hooks/ # Integration hooks (K8s, OpenStack, Prometheus)
│ ├── models/ # Policy/value networks
│ ├── rl/ # PPO implementation (minimal)
│ ├── sched/ # Two-tier policy wrapper
│ ├── tools/ # Plotting, seeding, io helpers
│ └── run_experiment.py # CLI entrypoint
├── tests/ # Unit tests (smoke-level)
├── .github/workflows/ci.yml # CI: lint + unit tests
├── Dockerfile # Container to run experiments
├── docker-compose.yaml # Optional: launches a Prometheus stub & experiment container
├── Makefile
├── requirements.txt
├── LICENSE
└── CITATION.cff
- Choose an experiment YAML under
configs/experiments/(e.g.,small_demo.yaml,ablation_fedprox.yaml). - Run the experiment.
- Use
src/tools/plot_results.pyto generate latency CDFs, learning curves, and Pareto plots. - Insert figures into
docs/paper/as instructed indocs/paper/outline.md.
- The code ships with safe defaults (dry-run). Live integration requires explicit env vars and kubeconfig/OpenStack creds.
- Review
configs/integrations/*.yamlandsrc/hooks/*before enabling live mode in production environments.
- Licensed under MIT (see
LICENSE). - Please cite using
CITATION.cff.