Skip to content

Conversation

ashuaibi7
Copy link

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2022

detailed analysis revealed QPS drop when enabling additional logging. head to head comparison of time it takes to do the logging reveals 4x increase in duration (see (https://fburl.com/scuba/tbe_stats_runtime/y85ur4k9)

  • avg QPS across 4 runs w/o logging: 246k vs. avg QPS across 4 runs w/ added logging: 243k (~1.2% QPS drop)

ran the following models w/o added logging:

  • aps-icvrbase-tbe-dump-test-old-1-d234a33214
  • aps-icvrbase-tbe-dump-test-old-2-f5d7f5d97a
  • aps-icvrbase-tbe-dump-test-old-3-92aa2d14c3
  • aps-icvrbase-tbe-dump-test-old-timed-9ac1869846

ran the following models w/ added logging:

  • aps-icvrbase-tbe-dump-test-new-1-fcb93df6a6
  • aps-icvrbase-tbe-dump-test-new-2-3f15ec3a29
  • aps-icvrbase-tbe-dump-test-new-3-211d3c3f01
  • aps-icvrbase-tbe-dump-test-new-timed-6e6a932849

Differential Revision: D84727563

Copy link

netlify bot commented Oct 15, 2025

Deploy Preview for pytorch-fbgemm-docs failed.

Name Link
🔨 Latest commit c5b2a8b
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/68eff35377713f00082920da

@meta-cla meta-cla bot added the cla signed label Oct 15, 2025
Copy link
Contributor

meta-codesync bot commented Oct 15, 2025

@ashuaibi7 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D84727563.

Ahmed Shuaibi added 4 commits October 15, 2025 12:17
)

Summary:

X-link: facebookresearch/FBGEMM#2021

Add granular sparse static memory breakdown metrics for TBE to enable validation of planner estimates against runtime memory usage. This implementation separates static sparse memory (weights, optimizer states, cache) from ephemeral memory (activations, IO buffers, gradients) and provides per-component HBM/UVM categorization. The existing `tbe.total_hbm_usage` aggregates all memory without distinguishing between persistent storage and ephemeral buffers, making it difficult to identify and validate static sparse parameter estimates.

## Changes

### 1. New Scuba Metrics (`tbe_stats_reporters.py`)
Added 10 granular memory metrics to `SyncBatchODSStatsReporter`:

**HBM metrics:**
- `tbe.hbm.sparse_params` - Embedding weights in HBM
- `tbe.hbm.optimizer_states` - Momentum states in HBM
- `tbe.hbm.cache` - Cache storage in HBM
- `tbe.hbm.total_static_sparse` - Total static memory in HBM
- `tbe.hbm.ephemeral` - Ephemeral memory in HBM (activations, temp buffers, etc.)

**UVM metrics:** (same structure for UVM)

### 2. Memory Categorization Logic (`split_table_batched_embeddings_ops_training.py`)
- Added helper methods:
  - `_get_tensor_memory()` - Get tensor memory size
  - `_categorize_memory_by_location()` - Categorize tensors into HBM/UVM
- Refactored `_report_tbe_mem_usage()` with clean list-based tensor grouping

### 3. Memory Components
**Static Sparse:**
- Weights: `weights_dev`, `weights_host`, `weights_uvm`
- Optimizer: `momentum1_dev/host/uvm`, `momentum2_dev/host/uvm`
- Cache: `lxu_cache_weights`, `lxu_cache_state`, `lxu_state`, cache aux data

**Ephemeral (calculated):**
- `ephemeral = total_mem_usage - static_sparse`
- Includes IO buffers, activations, gradients

Differential Revision: D84624978
Summary:

X-link: facebookresearch/FBGEMM#2022

detailed analysis revealed QPS drop when enabling additional logging. head to head comparison of time it takes to do the logging reveals 4x increase in duration (see (https://fburl.com/scuba/tbe_stats_runtime/y85ur4k9)

- avg QPS across 4 runs w/o logging: 246k vs. avg QPS across 4 runs w/ added logging: 243k (~1.2% QPS drop)

ran the following models w/o added logging:
- aps-icvrbase-tbe-dump-test-old-1-d234a33214
- aps-icvrbase-tbe-dump-test-old-2-f5d7f5d97a
- aps-icvrbase-tbe-dump-test-old-3-92aa2d14c3
- aps-icvrbase-tbe-dump-test-old-timed-9ac1869846

ran the following models w/ added logging:
- aps-icvrbase-tbe-dump-test-new-1-fcb93df6a6
- aps-icvrbase-tbe-dump-test-new-2-3f15ec3a29
- aps-icvrbase-tbe-dump-test-new-3-211d3c3f01
- aps-icvrbase-tbe-dump-test-new-timed-6e6a932849

Differential Revision: D84727563
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant