Skip to content

BlckKn1fe/federated-rl-multicloud

Repository files navigation

Federated Reinforcement Learning for Multi-Cloud Load Balancing and Task Scheduling

This repository implements an academic research prototype for federated reinforcement learning (FRL) across multi‑cloud regions/providers to jointly learn load balancing and task scheduling policies under latency, SLO, and carbon/energy constraints.

Stack: Python · PyTorch · YAML configs · Docker · GitHub Actions · Integration hooks for Kubernetes/OpenStack · Prometheus metrics adapters
Focus: Reproducible experiments, ablation studies, and paper-ready figures.


Key Features

  • Federated RL (FedAvg + optional FedProx) coordinating regional PPO agents.
  • Two-tier policy: (1) Load balancer picks a cloud/region; (2) Scheduler assigns to a node/pool.
  • Multi-objective rewards combining latency, queueing delay, SLO violations, cost, and carbon intensity.
  • Integration hooks (mock+optional live):
    • Kubernetes: cluster metrics + deployment scaling via kubernetes client (optional, with safe stubs for dry-run).
    • OpenStack: Nova/Neutron stubs for VM placement decisions.
    • Prometheus: metric scraping adapters.
  • Academic package: configs, baselines, ablations, seeding, experiment runner, result exports, and paper materials in docs/.
  • Reproducibility: fixed random seeds, logged configs, deterministic ops where possible.

Quickstart

# 1) Create env
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 2) Smoke test (synthetic multi-cloud simulation)
python -m src.run_experiment --config configs/experiments/small_demo.yaml

# 3) Plot results
python -m src.tools.plot_results --input results/small_demo/metrics.csv --out results/small_demo/plots

Note: Kubernetes/OpenStack hooks default to dry-run unless you set INTEGRATION_MODE=live and provide credentials.


Repository Layout

federated-rl-multicloud/
├── configs/                 # YAML configs for env/agents/experiments
├── docs/                    # Paper materials: abstract, outline, figs (Mermaid), checklist
├── notebooks/               # Minimal notebooks to inspect logs and metrics
├── results/                 # (created at runtime) experiment outputs
├── src/
│   ├── envs/                # Multi-cloud simulator
│   ├── federated/           # Aggregation server & client logic
│   ├── hooks/               # Integration hooks (K8s, OpenStack, Prometheus)
│   ├── models/              # Policy/value networks
│   ├── rl/                  # PPO implementation (minimal)
│   ├── sched/               # Two-tier policy wrapper
│   ├── tools/               # Plotting, seeding, io helpers
│   └── run_experiment.py    # CLI entrypoint
├── tests/                   # Unit tests (smoke-level)
├── .github/workflows/ci.yml # CI: lint + unit tests
├── Dockerfile               # Container to run experiments
├── docker-compose.yaml      # Optional: launches a Prometheus stub & experiment container
├── Makefile
├── requirements.txt
├── LICENSE
└── CITATION.cff

Reproducing Paper Figures

  1. Choose an experiment YAML under configs/experiments/ (e.g., small_demo.yaml, ablation_fedprox.yaml).
  2. Run the experiment.
  3. Use src/tools/plot_results.py to generate latency CDFs, learning curves, and Pareto plots.
  4. Insert figures into docs/paper/ as instructed in docs/paper/outline.md.

Safety & Live Integrations

  • The code ships with safe defaults (dry-run). Live integration requires explicit env vars and kubeconfig/OpenStack creds.
  • Review configs/integrations/*.yaml and src/hooks/* before enabling live mode in production environments.

License & Citation

  • Licensed under MIT (see LICENSE).
  • Please cite using CITATION.cff.

About

Federated Reinforcement Learning for Multi-Cloud Load Balancing and Task Scheduling

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published