Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ Thanks for your interest in improving **TraceLens** — a toolkit that parses Py
- [Scope (optional)](#scope-optional)
- [Examples](#examples)
- [Commit Message Convention](#commit-message-convention)
- [Running Tests Locally](#running-tests-locally)

---

Expand Down Expand Up @@ -149,3 +150,31 @@ docs(readme-tracediff): add docs for jax tracediff
```

This format helps us to automatically generate changelogs and provide more clarity in versioning.

## Running Tests Locally

Once you have made your changes and are ready to open a PR, it is good practice to first run some tests locally. This will expose any errors that require fixing and ensure the code is safe to push. Please check the /tests folder to run the tests. You will require pytest to run these tests.

### Installing pytest

If you don't have pytest installed, you can add it using pip:

```sh
pip install pytest
```

### Running Tests

To run all tests in the `/tests` directory:

```sh
pytest tests/
```

To run a specific test file:

```sh
pytest tests/test_compare_perf_report.py
```

For more options and usage, see the [pytest documentation](https://docs.pytest.org/en/stable/).
124 changes: 109 additions & 15 deletions docs/TreePerf.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,16 @@ See LICENSE for license information.

TreePerf is a Python SDK that works in conjunction with the Trace2Tree project. PyTorch generates a trace JSON file during profiling, which Trace2Tree parses into a tree data structure representing hierarchical dependencies between CPU operations and GPU kernel executions. TreePerf builds on this tree structure to compute performance metrics at both the model and operation levels. It enables users to analyze, interpret, and optimize AI models by providing detailed performance insights essential for architectural design and performance optimization.

---
## Key Ideas

Key Ideas
1. Tree Structure for GPU Execution Analysis: The hierarchical tree structure, generated by Trace2Tree, enables straightforward computation of GPU execution times. By linking CPU operations to their corresponding GPU kernel launches, it allows seamless aggregation of kernel execution times and performance metrics at various levels of granularity.
2. Performance Metrics from JSON Parsing: Metrics such as FLOPS and FLOPS/Byte are derived by extracting operation parameters from the JSON event data structure. These parameters are parsed and fed into performance models, enabling precise computation of performance metrics.

---
2. Performance Metrics from JSON Parsing: Metrics such as FLOPS and FLOPS/Byte are derived by extracting operation parameters from the JSON event data structure. These parameters are parsed and fed into performance models, enabling precise computation of performance metrics.

## Key Featues

1. **GPU timeline breakdown**: Provides a high-level view of activity of the GPU which includes busy time, idle time, communication time, etc
### 1. GPU timeline breakdown

Provides a high-level view of activity of the GPU which includes busy time, idle time, communication time, etc
Example output:

| type | time ms | percent |
Expand All @@ -31,8 +29,9 @@ Example output:
| idle_time | 4.717306 | 0.072283 |
| total_time | 6526.175517 | 100.000000 |

### 2. GPU compute time breakdown by CPU op

2. **GPU compute time breakdown by CPU op**: Analyze the performance breakdown by examining the lowest-level CPU operations (from the call stack perspective) and the time they "induce" on the GPU.
Analyze the performance breakdown by examining the lowest-level CPU operations (from the call stack perspective) and the time they "induce" on the GPU.
Unlike traditional methods that directly look at CPU and GPU durations, this feature provides insight into how CPU operations translate into GPU kernel launches, offering a more stable and interpretable abstraction of performance.
Example output showing top 5 ops sorted by the total GPU time each op induces:

Expand All @@ -44,8 +43,8 @@ Example output showing top 5 ops sorted by the total GPU time each op induces:
| aten::copy_ | 4 | 69.96 | 1.11 | 95.15 |
| triton_poi_fused_add_fill_mul_sigmoid_silu_sub_0 | 8 | 43.19 | 0.69 | 95.84 |

### 3. Roofline metrics

3. **Roofline metrics**
Example output for aten::mm showing top 5 param combos sorted by the total GPU time each param combo induces:

| name | param: M | param: N | param: K | param: bias | FLOPS/Byte_first | TFLOPS/s_mean |
Expand All @@ -56,13 +55,108 @@ Example output showing top 5 ops sorted by the total GPU time each op induces:
| aten::mm | 73728 | 128256 | 8192 | False | 6972.01 | 628.10 |
| aten::mm | 8192 | 28672 | 73728 | False | 5864.73 | 599.95 |

- Adding new operations is simple (Contributions are welcome!):
- perf_model.py
- Parse operation shapes from the JSON trace
- Write the performance model (FLOPS, bytes) or reuse an existing one
- torch_op_mapping.py: Map the operation to the perf model here

For more details and walkthrough checkout the base_example.ipynb notebook
**Replace the profile path in base_example.ipynb by your profile file and get insights instantly!**

Once you are familiar with the workflows you can directly use or modify the **generate_perf_report.py** script to generate a report excel sheet quickly.
Once you are familiar with the workflows you can directly use or modify the **generate_perf_report.py** script to generate a report excel sheet quickly.

## How to Add a New Perf Model

To add a new performance model (perf model) to TreePerf, follow these steps. This guide uses an attention-based perf model as an example, since these are frequently added. Although this example is for PyTorch, the perf models for JAX can be added in a similar fashion.

**Most important:** For attention and similar ops, the main change is usually in the `get_param_details` method. This method extracts parameters (such as input shapes, dropout, etc.) from the trace event, and the indices/positions of these parameters often differ between ops. You typically do NOT need to overload `flops` or `bytes` unless the computation itself changes.

### 1. Implement the Perf Model Class

- Go to `perf_model.py`.
- Create a new class for your operation, inheriting from the appropriate base class (`SDPA` for attention, `GEMM` for matrix multiplication, etc.).
- Implement required methods:
- `get_param_details`: **This is usually the only method you need to change.** Update the indices/positions to correctly extract parameters (e.g., input shapes, dropout) from the event, based on your op's argument order in the Kineto trace.
- `__init__`: Call `get_param_details` and store parsed parameters.
- `flops`, `bytes`: Usually do not need to be changed unless the computation logic is different.
- (Optional) `flops_bwd`, `bytes_bwd`: For backward pass metrics.

### Example: Adding a new attention-based perf model (with custom get_param_details)

```python
class my_attention_forward(SDPA):
def get_param_details(self, event):
# Example: change indices to match your op's argument order
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a real world example of the src function signature and how we get the indices from there?
maybe we can use tri dao flash attention or something like that?

q_shape = event['args'][0]['shape'] # index may differ
k_shape = event['args'][1]['shape']
v_shape = event['args'][2]['shape']
dropout = event['args'][5] # index may differ
# ... parse other params as needed
return { 'q_shape': q_shape, 'k_shape': k_shape, 'v_shape': v_shape, 'dropout': dropout }

@staticmethod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will override the base class method i believe, so we should not have this

def flops(self):
# Usually unchanged
pass

def bytes(self, bytes_per_element=2):
# Usually unchanged
pass
```

### 2. Map the Operation Name to the Perf Model

- Go to `torch_op_mapping.py`.
- Add your operation name and class to `op_to_perf_model_class_map`. Note that this is the name by which this op appears in the PyTorch Kineto trace.

```python
op_to_perf_model_class_map['my_attention::forward'] = perf_model.my_attention_forward
```

### 3. Categorize the Operation (Optional)

- If your operation fits a new category, update `dict_base_class2category` and `dict_cat2names` accordingly.

### 4. Test and Validate

- Ensure your perf model parses parameters correctly and computes metrics as expected.
- Use the example notebook or scripts to verify integration.

---

## Example: Adding a New Attention Perf Model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is mostly repeated content from just above


The most commonly requested perf model change is for a new variant of attention. Note that most attention variants have very similar function signatures (type of arguments and their position). In this case, the only change to be made is the position or index of the specific function parameters such as input shape, dropout, etc. Carefully change these indices based on the order of arguments in your Kineto trace.

**Key point:** For most attention variants, you only need to update the indices/positions in `get_param_details` to match the argument order in your trace. The rest of the logic (FLOPs, bytes) can usually be reused.

### 1. In `perf_model.py:`

```python
class my_attention_forward(SDPA):
def get_param_details(self, event):
# Update indices to match your op's argument order
q_shape = event['args'][0]['shape']
k_shape = event['args'][1]['shape']
v_shape = event['args'][2]['shape']
dropout = event['args'][5]
return { 'q_shape': q_shape, 'k_shape': k_shape, 'v_shape': v_shape, 'dropout': dropout }

@staticmethod
def flops(self):
pass

def bytes(self, bytes_per_element=2):
pass
```

### 2. In `torch_op_mapping.py:`

```python
op_to_perf_model_class_map['my_attention::forward'] = perf_model.my_attention_forward
```

---

## Tips

- Reuse parameter parsing logic from similar attention models.
- Follow the structure and naming conventions used in existing models.
- If your operation is a variant (e.g., backward, varlen), inherit from the closest existing perf model and override only necessary methods.

For more details, see the comments and examples in `perf_model.py` and `torch_op_mapping.py`.