Skip to content

Fetch and analyze historical data from your W&B experiments and gain insights into your machine learning projects by auditing compute usage and resource utilization.

License

Notifications You must be signed in to change notification settings

valohai/wandb-efficiency-audit

Repository files navigation

Weights & Biases Efficiency Audit Tool

The Weights & Biases Efficiency Audit Tool is a Python-based utility designed to fetch and analyze historical data from your experiment tracking platform. This tool helps you gain insights into your machine learning projects by auditing compute usage and resource utilization.

Features

  • Fetch historical metrics, parameters, and metadata for all experiments.
  • Analyze GPU and CPU utilization metrics, including full metric history for GPUs.
  • Export results to a detailed Excel report, including raw data, summary, and image report.

Gain insights into your machine learning projects by auditing compute usage and resource utilization.

Utilization / Cost

  • Are you using the optimal machine sizes for your workloads?
  • How often do you have idle GPU or entire machines?
  • How much compute is wasted on idle GPU time?
  • What's the financial impact of underutilized resources?

Efficiency:

  • What's your overall GPU utilization across all experiments?
  • How many experiments run with 0% GPU utilization?
  • Which runs represent the biggest optimization opportunities?
  • How does your efficiency break down across different runs?

Performance Analysis

  • Track CPU, GPU memory, and disk utilization
  • Analyze network I/O patterns
  • Review system metrics across all experiments
  • Identify efficiency patterns over time

Folder Structure

wandb-efficiency-audit/
│
├── wandb_efficiency_audit.py    # Main script
├── generate_report_image.py     # Helper functions for visual report generation
├── fonts/                       # Font files for report generation
├── README.md                    # Documentation
└── pyproject.toml               # Python project configuration

Installation

Prerequisites

  • Python 3.9 or higher
  • An active W&B account (or access to public W&B projects)
  • Turn on W&B System Metrics monitoring (usually enabled by default)

Setup

  1. Clone this repository:

    git clone https://github.com/valohai/wandb-efficiency-audit.git
    cd wandb-efficiency-audit
  2. Create a virtualenv and install the required dependencies:

    python -m venv venv
    source venv/bin/activate  # This depends on your shell
    pip install -e .
  3. (Optional) Log in to W&B if accessing private projects:

wandb login

Note: No login required for public W&B projects.

Usage

  1. Run the script to generate the audit report:

    wandb-efficiency-audit --project "entity/project"
  2. To analyze only completed runs (excluding failed/crashed runs):

    wandb-efficiency-audit --project "entity/project" --completed-only
  3. The report will be saved as experiment_metrics_summary.xlsx in the current directory.

Output Files

The tool generates two main outputs:

  1. experiment_metrics_summary.xlsx - A comprehensive Excel workbook containing:

    • Summary sheet with visual report, key metrics, and methodology
    • Cost analysis and efficiency distribution
    • Example runs with biggest optimization opportunities
    • Detailed metrics sheet with all raw data
  2. Visual report PNG (embedded in Excel) showing:

    • Total GPU utilization percentage
    • Total GPU idle time
    • Percentage of runs with 0% GPU utilization
    • Cost of idle compute by GPU type
    • Example runs with low utilization

Understanding the Results

Efficiency Score Categories

  • Excellent (70%+): Optimal GPU utilization
  • Good (50-70%): Acceptable utilization with minor optimization potential
  • Fair (30-50%): Significant room for improvement
  • Poor (10-30%): Major underutilization issues
  • Critical (<10%): Severe waste, immediate action recommended

Key Metrics Explained

  • Total GPU Utilization: Average GPU core utilization over time across all runs
  • Total GPU Idle Time: Total runtime multiplied by (100% - Average GPU utilization)
  • Runs with 0% GPU Utilization: Share of runs that have a GPU Core that was not utilized at all during the whole run
  • Cost of Idle Compute: Estimated cost of unused GPU time based on AWS on-demand pricing
  • Example runs: Example runs that have a low utilization and are in the 25% of the longest runs found in the project

Requirements

The project dependencies are:

  • wandb — Interact with W&B tracking server.
  • pandas — Data processing and analysis.
  • openpyxl — Generate Excel reports.
  • pillow — Generate visual report images.
  • requests — HTTP requests for data submission.

About

Fetch and analyze historical data from your W&B experiments and gain insights into your machine learning projects by auditing compute usage and resource utilization.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •